Skip to content

Standalone spark connect trace per session#11521

Open
aboitreaud wants to merge 5 commits into
masterfrom
adrien.boitreaud/spark-connect-per-session-traces
Open

Standalone spark connect trace per session#11521
aboitreaud wants to merge 5 commits into
masterfrom
adrien.boitreaud/spark-connect-per-session-traces

Conversation

@aboitreaud
Copy link
Copy Markdown
Contributor

@aboitreaud aboitreaud commented Jun 1, 2026

What Does This Do

Emit one spark.application trace per Spark Connect session instead of a single global one for the whole server lifetime.

When the listener detects a Connect job (via the SparkConnect_OperationTag_..._Session_<id>_Operation_<id> tag on spark.jobTags), it parents spark.sql / spark.job spans under a per-session spark.application root span (tagged with session_id and spark.connect.server=true) rather than the global one. The per-session spans are kept in bounded maps and finished on server shutdown.

Motivation

A Spark Connect server is long-lived and hosts many independent client sessions. Today every session's work rolls up into one spark.application trace, which is unusable for per-user / per-session analysis and grows without bound.

https://datadoghq.atlassian.net/wiki/spaces/DNAP/pages/6756174072/Problem+Statement+Spark+Connect+and+Data+Jobs+Monitoring?atlOrigin=eyJpIjoiZmM0NmExYzI5OWRjNGVlMDk5NDg4NjExNDkzMDg5ZGIiLCJwIjoiY29uZmx1ZW5jZS1jaGF0cy1pbnQifQ

Additional Notes

  • Pure Connect servers (no global app span) now flush traces synchronously in finishApplication so per-session spans are not lost on JVM shutdown.
  • All per-session maps are guarded by the same maxCollectionSize() cap used elsewhere in the listener; on overflow, jobs fall back to the global spark.application span.
  • TODO left in finishApplication: per-session spans are currently closed on server shutdown rather than on session close, because Spark Connect does not surface a session-close event through SparkListener. To revisit when that hook becomes available.

Contributor Checklist

  • Format the title according to the contribution guidelines
  • Assign the type: and (comp: or inst:) labels in addition to any other useful labels
  • Avoid using close, fix, or any linking keywords when referencing an issue
    Use solves instead, and assign the PR milestone to the issue
  • Update the CODEOWNERS file on source file addition, migration, or deletion
  • Update public documentation with any new configuration flags or behaviors
  • Add your completed PR to the merge queue by commenting `/merge`. You can also:
    • Customize the commit message associated with the merge with `/merge --commit-message "..."`
    • Remove your PR from the merge queue with `/merge -c`
    • Skip all merge queue checks with `/merge -f --reason "reason"`; please use this judiciously, as some checks do not run at the PR-level
    • Get more information in this doc

Jira ticket: [PROJ-IDENT]

@aboitreaud aboitreaud requested a review from a team as a code owner June 1, 2026 15:32
@aboitreaud
Copy link
Copy Markdown
Contributor Author

@codex review

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Jun 1, 2026

Hi! 👋 Thanks for your pull request! 🎉

To help us review it, please make sure to:

  • Add at least one type, and one component or instrumentation label to the pull request

If you need help, please check our contributing guidelines.

@datadog-datadog-prod-us1-2
Copy link
Copy Markdown
Contributor

datadog-datadog-prod-us1-2 Bot commented Jun 1, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 8 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-java | test_inst: [17, 5/8]   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. 7 failed tests in ReactorNettyHttpClientTest due to SSLException and TimeoutException.

Check pull requests | Check pull requests   View in Datadog   GitHub Actions

 platform error Please add at least one type, and one component or instrumentation label to the pull request.

DataDog/apm-reliability/dd-trace-java | java-startup-parallel-check-slo-breaches   View in Datadog   GitLab

 platform error Failed to retrieve GitHub token. The policy 'gitlab.github-access.read' might not exist. Error: No such file or directory when accessing 'artifacts/fail-on-breach.github-token'.

View all 8 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 2e460d8 | Docs | Datadog PR Page | Give us feedback!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e62cd6dbc1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +609 to +610
if (tag.startsWith("spark-connect-session-")) {
String id = tag.substring("spark-connect-session-".length());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Recognize Spark's actual session job tag

In Spark Connect, jobs are tagged with the session tag generated by SparkSession.sessionJobTag, which is spark-session-$sessionUUID and is propagated through SQLExecution.withSessionTagsApplied; I could not find any Spark source path that emits spark-connect-session-. With this prefix check, real Connect jobs will return null here and continue to be parented under the global application span, so the new per-session trace behavior never activates outside these tests' synthetic properties.

Useful? React with 👍 / 👎.

@aboitreaud
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2e460d84d3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +530 to 535
} else if (connectSessionId != null) {
AgentSpan sessionAppSpan =
getOrCreatePerSessionApplicationSpan(connectSessionId, queryStart.time(), jobProperties);
spanBuilder.asChildOf(sessionAppSpan.context());
} else if (isRunningOnDatabricks) {
addDatabricksSpecificTags(spanBuilder, jobProperties, true);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve Databricks parenting for Connect SQL jobs

When a SQL job on a Databricks cluster also carries Spark Connect spark.jobTags, this new branch wins before isRunningOnDatabricks, so addDatabricksSpecificTags(...) is skipped for the SQL span; because onJobStart then parents the job under that SQL span via the sqlSpan != null branch, neither span gets the Databricks job/run/task tags or parent context that this listener previously added for Databricks jobs. This only affects Spark Connect SQL workloads running in a Databricks environment; keep the Databricks path ahead of the Connect-session path or apply the Databricks tags/parenting as well.

Useful? React with 👍 / 👎.

Comment on lines +402 to +408
sessionAppSpan.finish(time * 1000);
}
perSessionApplicationSpans.clear();
perSessionApplicationMetrics.clear();
perSessionLastJobFailed.clear();
perSessionLastJobFailedMessage.clear();
perSessionLastJobFailedStackTrace.clear();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Flush finished session spans before the early return

For a pure Spark Connect server, the comment above notes that applicationSpan == null with jobCount > 0, so after these per-session application spans are finished the existing guard at line 410 returns before reaching the tracer.flush() used at application shutdown. Since these session root spans remain open until finishApplication, they can be left only in the writer queue when the JVM/SparkContext is stopping and may be dropped instead of being synchronously written; flush after finishing the per-session spans when this early-return path is taken.

Useful? React with 👍 / 👎.

@aboitreaud aboitreaud added inst: apache spark Apache Spark instrumentation type: enhancement Enhancements and improvements labels Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

inst: apache spark Apache Spark instrumentation type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant