Standalone spark connect trace per session#11521
Conversation
…application spans
…rengthen overflow test
|
@codex review |
|
Hi! 👋 Thanks for your pull request! 🎉 To help us review it, please make sure to:
If you need help, please check our contributing guidelines. |
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e62cd6dbc1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (tag.startsWith("spark-connect-session-")) { | ||
| String id = tag.substring("spark-connect-session-".length()); |
There was a problem hiding this comment.
Recognize Spark's actual session job tag
In Spark Connect, jobs are tagged with the session tag generated by SparkSession.sessionJobTag, which is spark-session-$sessionUUID and is propagated through SQLExecution.withSessionTagsApplied; I could not find any Spark source path that emits spark-connect-session-. With this prefix check, real Connect jobs will return null here and continue to be parented under the global application span, so the new per-session trace behavior never activates outside these tests' synthetic properties.
Useful? React with 👍 / 👎.
…d of non-existent spark-connect-session-
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2e460d84d3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } else if (connectSessionId != null) { | ||
| AgentSpan sessionAppSpan = | ||
| getOrCreatePerSessionApplicationSpan(connectSessionId, queryStart.time(), jobProperties); | ||
| spanBuilder.asChildOf(sessionAppSpan.context()); | ||
| } else if (isRunningOnDatabricks) { | ||
| addDatabricksSpecificTags(spanBuilder, jobProperties, true); |
There was a problem hiding this comment.
Preserve Databricks parenting for Connect SQL jobs
When a SQL job on a Databricks cluster also carries Spark Connect spark.jobTags, this new branch wins before isRunningOnDatabricks, so addDatabricksSpecificTags(...) is skipped for the SQL span; because onJobStart then parents the job under that SQL span via the sqlSpan != null branch, neither span gets the Databricks job/run/task tags or parent context that this listener previously added for Databricks jobs. This only affects Spark Connect SQL workloads running in a Databricks environment; keep the Databricks path ahead of the Connect-session path or apply the Databricks tags/parenting as well.
Useful? React with 👍 / 👎.
| sessionAppSpan.finish(time * 1000); | ||
| } | ||
| perSessionApplicationSpans.clear(); | ||
| perSessionApplicationMetrics.clear(); | ||
| perSessionLastJobFailed.clear(); | ||
| perSessionLastJobFailedMessage.clear(); | ||
| perSessionLastJobFailedStackTrace.clear(); |
There was a problem hiding this comment.
Flush finished session spans before the early return
For a pure Spark Connect server, the comment above notes that applicationSpan == null with jobCount > 0, so after these per-session application spans are finished the existing guard at line 410 returns before reaching the tracer.flush() used at application shutdown. Since these session root spans remain open until finishApplication, they can be left only in the writer queue when the JVM/SparkContext is stopping and may be dropped instead of being synchronously written; flush after finishing the per-session spans when this early-return path is taken.
Useful? React with 👍 / 👎.
What Does This Do
Emit one
spark.applicationtrace per Spark Connect session instead of a single global one for the whole server lifetime.When the listener detects a Connect job (via the
SparkConnect_OperationTag_..._Session_<id>_Operation_<id>tag onspark.jobTags), it parentsspark.sql/spark.jobspans under a per-sessionspark.applicationroot span (tagged withsession_idandspark.connect.server=true) rather than the global one. The per-session spans are kept in bounded maps and finished on server shutdown.Motivation
A Spark Connect server is long-lived and hosts many independent client sessions. Today every session's work rolls up into one
spark.applicationtrace, which is unusable for per-user / per-session analysis and grows without bound.https://datadoghq.atlassian.net/wiki/spaces/DNAP/pages/6756174072/Problem+Statement+Spark+Connect+and+Data+Jobs+Monitoring?atlOrigin=eyJpIjoiZmM0NmExYzI5OWRjNGVlMDk5NDg4NjExNDkzMDg5ZGIiLCJwIjoiY29uZmx1ZW5jZS1jaGF0cy1pbnQifQ
Additional Notes
finishApplicationso per-session spans are not lost on JVM shutdown.maxCollectionSize()cap used elsewhere in the listener; on overflow, jobs fall back to the globalspark.applicationspan.finishApplication: per-session spans are currently closed on server shutdown rather than on session close, because Spark Connect does not surface a session-close event throughSparkListener. To revisit when that hook becomes available.Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]