Skip to content

BUG: create_service fails inside Snowflake Notebooks (SPCS Runtime) #229

@lawrenceadams

Description

@lawrenceadams

create_service fails inside Snowflake Notebooks: _check_if_service_exists only catches SnowparkSQLException, not the connector ProgrammingError raised by async collect()

Description

When deploying a model to SPCS with ModelVersion.create_service(...) from inside a Snowflake Notebook (Container Runtime), the call fails on a first-time deploy with:

ProgrammingError: 002003: SQL compilation error:
Service '<DB>.<SCHEMA>.<SERVICE>' does not exist or not authorized.

This happens during the pre-flight existence probe, before the service is created — i.e. it fails precisely because the service does not exist yet, which is the normal first-deploy case the code is meant to handle gracefully.

The same code path works correctly from a plain Snowpark session (local Python / SnowCLI). The failure is specific to the Snowflake Notebook runtime.

Environment

  • snowflake-ml-python: 1.39.0
  • Runtime: Snowflake Notebooks on Container Runtime
  • Model type: Hugging Face transformers text-generation pipeline logged with OPENAI_CHAT_SIGNATURE
  • Deploy target: GPU compute pool, inference_engine_options={"engine": InferenceEngine.VLLM, ...}

Steps to reproduce

  1. Log a model version to the registry (any model is sufficient to trigger the probe).
  2. From a Snowflake Notebook cell, call create_service for a service name that does not already exist:
mv = reg.get_model("<MODEL>").version("V1_0_0")
mv.create_service(
    service_name="<SERVICE>",
    service_compute_pool="<GPU_POOL>",
    gpu_requests="1",
    max_instances=1,
    ingress_enabled=False,
)
  1. The call uploads deployment artifacts, then raises ProgrammingError: 002003 ... does not exist or not authorized instead of proceeding to create the service.

Abridged traceback

mv.create_service(...)
  -> ServiceOperator.create_service (service_ops.py)
     -> _check_if_service_exists (service_ops.py)
        -> ServiceSQLClient.get_service_container_statuses (sql/service.py)
           -> session.sql("SHOW SERVICE CONTAINERS IN SERVICE <DB>.<SCHEMA>.<SERVICE>").collect()
              -> snowflake_notebook_utils/session_bootstrap.py:_cancellable_collect
                 -> async_job.result(result_type="row")
                    -> cursor.get_results_from_sfqid
                       -> connection.get_query_status_throw_if_error
                          -> raises snowflake.connector.errors.ProgrammingError (002003)
Full Traceback
ProgrammingError: 002003: SQL compilation error:
Service 'CUSTOM_MODEL.MODELS.CUSTOM_MODEL_SVC' does not exist or not authorized.
---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
Cell In[47], line 1
----> 1 mod.create_service(
      2     service_name="CUSTOM_MODEL.MODELS.CUSTOM_MODEL_SVC",
      3     service_compute_pool="GPU_MODEL_ENDPOINT",
      4     gpu_requests="1",

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/ml/_internal/telemetry.py:611, in send_api_usage_telemetry.<locals>.decorator.<locals>.wrap(*args, **kwargs)
    602 telemetry_args = dict(
    603     func_name=_get_full_func_name(func),
    604     function_category=TelemetryField.FUNC_CAT_USAGE.value,
   (...)    608     custom_tags=final_custom_tags,
    609 )
    610 try:
--> 611     return ctx.run(execute_func_with_statement_params)
    612 except Exception as e:
    613     if not isinstance(e, snowml_exceptions.SnowflakeMLException):
    614         # already handled via a nested decorated function

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/ml/_internal/telemetry.py:576, in send_api_usage_telemetry.<locals>.decorator.<locals>.wrap.<locals>.execute_func_with_statement_params()
    574 def execute_func_with_statement_params() -> _ReturnValue:
    575     _patch_manager.set_statement_params(statement_params)
--> 576     result = func(*args, **kwargs)
    577     return update_stmt_params_if_snowpark_df(result, statement_params)

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/ml/model/_client/model/model_version_impl.py:1492, in ModelVersion.create_service(self, service_name, image_build_compute_pool, service_compute_pool, image_repo, ingress_enabled, min_instances, max_instances, cpu_requests, memory_requests, gpu_requests, num_workers, max_batch_rows, force_rebuild, build_external_access_integration, build_external_access_integrations, block, autocapture, inference_engine_options, experimental_options)
   1490 with model_event_handler.status("Creating model inference service", total=6, block=block) as status:
   1491     try:
-> 1492         result = self._service_ops.create_service(
   1493             database_name=None,
   1494             schema_name=None,
   1495             model_name=self._model_name,
   1496             version_name=self._version_name,
   1497             service_database_name=service_db_id,
   1498             service_schema_name=service_schema_id,
   1499             service_name=service_id,
   1500             image_build_compute_pool_name=(
   1501                 sql_identifier.SqlIdentifier(image_build_compute_pool)
   1502                 if image_build_compute_pool
   1503                 else sql_identifier.SqlIdentifier(service_compute_pool)
   1504             ),
   1505             service_compute_pool_name=sql_identifier.SqlIdentifier(service_compute_pool),
   1506             image_repo_name=image_repo,
   1507             ingress_enabled=ingress_enabled,
   1508             min_instances=min_instances,
   1509             max_instances=max_instances,
   1510             cpu_requests=cpu_requests,
   1511             memory_requests=memory_requests,
   1512             gpu_requests=gpu_requests,
   1513             num_workers=num_workers,
   1514             max_batch_rows=max_batch_rows,
   1515             force_rebuild=force_rebuild,
   1516             build_external_access_integrations=(
   1517                 None
   1518                 if build_external_access_integrations is None
   1519                 else [sql_identifier.SqlIdentifier(eai) for eai in build_external_access_integrations]
   1520             ),
   1521             block=block,
   1522             statement_params=statement_params,
   1523             progress_status=status,
   1524             inference_engine_args=inference_engine_args,
   1525             autocapture=autocapture,
   1526         )
   1527         status.update(label="Model service created successfully", state="complete", expanded=False)
   1528         return result

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/ml/model/_client/ops/service_ops.py:303, in ServiceOperator.create_service(self, database_name, schema_name, model_name, version_name, service_database_name, service_schema_name, service_name, image_build_compute_pool_name, service_compute_pool_name, image_repo_name, ingress_enabled, min_instances, max_instances, cpu_requests, memory_requests, gpu_requests, num_workers, max_batch_rows, force_rebuild, build_external_access_integrations, block, progress_status, statement_params, hf_model_args, inference_engine_args, autocapture)
    295     file_utils.upload_directory_to_stage(
    296         self._session,
    297         local_path=pathlib.Path(self._workspace.name),
    298         stage_path=pathlib.PurePosixPath(stage_path),
    299         statement_params=statement_params,
    300     )
    302 # check if the inference service is already running/suspended
--> 303 model_inference_service_exists = self._check_if_service_exists(
    304     database_name=service_database_name,
    305     schema_name=service_schema_name,
    306     service_name=service_name,
    307     service_status_list_if_exists=[
    308         service_sql.ServiceStatus.RUNNING,
    309         service_sql.ServiceStatus.SUSPENDING,
    310         service_sql.ServiceStatus.SUSPENDED,
    311     ],
    312     statement_params=statement_params,
    313 )
    315 # Step 3: Initiating model deployment
    316 progress_status.update("initiating model deployment...")

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/ml/model/_client/ops/service_ops.py:937, in ServiceOperator._check_if_service_exists(self, database_name, schema_name, service_name, service_status_list_if_exists, statement_params)
    928     service_status_list_if_exists = [
    929         service_sql.ServiceStatus.PENDING,
    930         service_sql.ServiceStatus.RUNNING,
   (...)    934         service_sql.ServiceStatus.FAILED,
    935     ]
    936 try:
--> 937     statuses = self._service_client.get_service_container_statuses(
    938         database_name=database_name,
    939         schema_name=schema_name,
    940         service_name=service_name,
    941         include_message=False,
    942         statement_params=statement_params,
    943     )
    944     service_status = statuses[0].service_status
    945     return any(service_status == status for status in service_status_list_if_exists)

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/ml/model/_client/sql/service.py:246, in ServiceSQLClient.get_service_container_statuses(self, database_name, schema_name, service_name, include_message, statement_params)
    237 fully_qualified_object_name = self.fully_qualified_object_name(database_name, schema_name, service_name)
    238 query = f"SHOW SERVICE CONTAINERS IN SERVICE {fully_qualified_object_name}"
    239 rows = (
    240     query_result_checker.SqlResultValidator(self._session, query, statement_params=statement_params)
    241     .has_column(ServiceSQLClient.INSTANCE_STATUS)
    242     .has_column(ServiceSQLClient.CONTAINER_STATUS)
    243     .has_column(ServiceSQLClient.SERVICE_STATUS)
    244     .has_column(ServiceSQLClient.INSTANCE_ID)
    245     .has_column(ServiceSQLClient.MESSAGE)
--> 246     .validate()
    247 )
    248 statuses = []
    249 for r in rows:

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/ml/_internal/utils/query_result_checker.py:232, in ResultValidator.validate(self)
    226 def validate(self) -> list[snowpark.Row]:
    227     """Execute the query and validate the result.
    228 
    229     Returns:
    230         Query result.
    231     """
--> 232     result = self._get_result()
    233     for matcher in self._success_matchers:
    234         assert matcher(result, self._query)

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/ml/_internal/utils/query_result_checker.py:264, in SqlResultValidator._get_result(self)
    262 def _get_result(self) -> list[snowpark.Row]:
    263     """Collect the result of the given SQL query."""
--> 264     return self._session.sql(self._query).collect(statement_params=self._statement_params)

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake_notebook_utils/session_bootstrap.py:192, in SessionBootstrap.patch_dataframe_cancellation.<locals>._cancellable_collect(df_self, *args, **kwargs)
    188 query_id = async_job.query_id
    189 try:
    190     # Must be a value from snowflake.snowpark.async_job._AsyncResultType
    191     # ("row" maps to List[Row], matching DataFrame.collect()'s contract).
--> 192     return async_job.result(result_type="row")
    193 except BaseException:
    194     _cancel_snowflake_query(df_self._session._conn, query_id)

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/snowpark/async_job.py:399, in AsyncJob.result(self, result_type)
    365 """
    366 Blocks and waits until the query associated with this instance finishes, then returns query
    367 results. This acts like executing query in a synchronous way. The data type of returned
   (...)    394         the original result data type.
    395 """
    396 async_result_type = (
    397     _AsyncResultType(result_type.lower()) if result_type else self._result_type
    398 )
--> 399 self._cursor.get_results_from_sfqid(self.query_id)
    400 if self._num_statements is not None:
    401     for _ in range(self._num_statements - 1):

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/connector/cursor.py:1811, in SnowflakeCursorBase.get_results_from_sfqid(self, sfqid)
   1808     else:
   1809         return False
-> 1811 self.connection.get_query_status_throw_if_error(
   1812     sfqid
   1813 )  # Trigger an exception if query failed
   1814 self._inner_cursor = self.__class__(self.connection)
   1815 self._sfqid = sfqid

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/connector/connection.py:2485, in SnowflakeConnection.get_query_status_throw_if_error(self, sf_qid)
   2483 self._cache_query_status(sf_qid, status)
   2484 if self.is_an_error(status):
-> 2485     self._process_error_query_status(sf_qid, status_resp)
   2486 return status

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/connector/connection.py:2444, in SnowflakeConnection._process_error_query_status(self, sf_qid, status_resp, error_message, error_cls)
   2442     message += queries[0].get("errorMessage", "") if queries else ""
   2443     sql_state = data.get("sqlState")
-> 2444 Error.errorhandler_wrapper(
   2445     self,
   2446     None,
   2447     error_cls,
   2448     {
   2449         "msg": message or error_message,
   2450         "errno": int(code),
   2451         "sqlstate": sql_state,
   2452         "sfqid": sf_qid,
   2453     },
   2454 )

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/connector/errors.py:298, in Error.errorhandler_wrapper(connection, cursor, error_class, error_value)
    275 @staticmethod
    276 def errorhandler_wrapper(
    277     connection: SnowflakeConnection | None,
   (...)    280     error_value: dict[str, Any],
    281 ) -> None:
    282     """Error handler wrapper that calls the errorhandler method.
    283 
    284     Args:
   (...)    295         exception to the first handler in that order.
    296     """
--> 298     handed_over = Error.hand_to_other_handler(
    299         connection,
    300         cursor,
    301         error_class,
    302         error_value,
    303     )
    304     if not handed_over:
    305         raise Error.errorhandler_make_exception(
    306             error_class,
    307             error_value,
    308         )

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/connector/errors.py:361, in Error.hand_to_other_handler(connection, cursor, error_class, error_value)
    359 elif connection is not None:
    360     try:
--> 361         connection.errorhandler(connection, cursor, error_class, error_value)
    362     except NotImplementedError:
    363         # for async compatibility, check SNOW-1763096 and SNOW-1763103
    364         connection._errorhandler(connection, cursor, error_class, error_value)

File /opt/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/site-packages/snowflake/connector/errors.py:229, in Error.default_errorhandler(connection, cursor, error_class, error_value)
    227 errno = error_value.get("errno")
    228 done_format_msg = error_value.get("done_format_msg")
--> 229 raise error_class(
    230     msg=error_value.get("msg"),
    231     errno=None if errno is None else int(errno),
    232     sqlstate=error_value.get("sqlstate"),
    233     sfqid=error_value.get("sfqid"),
    234     query=error_value.get("query"),
    235     done_format_msg=(
    236         None if done_format_msg is None else bool(done_format_msg)
    237     ),
    238     connection=connection,
    239     cursor=cursor,
    240 )

ProgrammingError: 002003: SQL compilation error:
Service 'CUSTOM_MODEL.MODELS.CUSTOM_MODEL_SVC' does not exist or not authorized.

Root cause

ServiceOperator._check_if_service_exists (in snowflake/ml/model/_client/ops/service_ops.py) runs SHOW SERVICE CONTAINERS IN SERVICE ... and relies on the query throwing when the service does not exist, catching that to return False:

try:
statuses = self._service_client.get_service_container_statuses(
database_name=database_name,
schema_name=schema_name,
service_name=service_name,
include_message=False,
statement_params=statement_params,
)
service_status = statuses[0].service_status
return any(service_status == status for status in service_status_list_if_exists)
except exceptions.SnowparkSQLException:
return False

In a normal Snowpark session, the failing SHOW raises snowflake.snowpark.exceptions.SnowparkSQLException, which is caught and the method returns False (service not found → proceed to create).

Inside a Snowflake Notebook, the runtime patches DataFrame.collect() to execute asynchronously (snowflake_notebook_utils.session_bootstrap.SessionBootstrap.patch_dataframe_cancellation._cancellable_collect). On that async path the failure surfaces as a bare snowflake.connector.errors.ProgrammingError (errno 2003), not a SnowparkSQLException. The except clause does not match, the exception escapes _check_if_service_exists, and it aborts create_service.

So the existence-check helper that is supposed to tolerate "service does not exist" only tolerates it for one of the two exception types that can carry that condition, and the notebook runtime produces the uncaught one.

Query

Is it possible to broaden the except in _check_if_service_exists to also catch the connector-level ProgrammingError for the not-found case. Ideally verify it really is a "does not exist / not authorized" error (errno 2003) rather than swallowing all connector errors, so genuine failures aren't masked:

Workaround

(Jank ++) Monkeypatch _check_if_service_exists in the notebook before calling create_service:

from snowflake.ml.model._client.ops import service_ops
from snowflake.connector import errors as sf_errors

_orig = service_ops.ServiceOperator._check_if_service_exists
def _patched(self, *args, **kwargs):
    try:
        return _orig(self, *args, **kwargs)
    except sf_errors.ProgrammingError as e:
        if getattr(e, "errno", None) == 2003 or "does not exist" in str(e):
            return False
        raise
service_ops.ServiceOperator._check_if_service_exists = _patched

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions