Skip to content

[telemetry] Detect Python package manager(s) at project setup#1918

Open
rugpanov wants to merge 1 commit into
mainfrom
telemetry-package-manager-detection
Open

[telemetry] Detect Python package manager(s) at project setup#1918
rugpanov wants to merge 1 commit into
mainfrom
telemetry-package-manager-detection

Conversation

@rugpanov

Copy link
Copy Markdown
Contributor

Changes

Measurement-only telemetry to learn which Python package manager(s) our users' projects actually use (pip / conda / uv / poetry), so the VPEX setup-flow investment can be prioritized from first-party data instead of public-survey estimates. No setup behavior changes — this is detection only.

The work splits cleanly into three layers so each is independently testable and the dependency direction stays correct (high-level → low-level):

  • Pure classifier (packageManagerDetection.ts): given a set of already-collected signals, reports every applicable manager, a best-guess primary (priority uv > poetry > conda > pip), the firing signals, hasLockfile, and interpreter source. Side-effect free and total.
  • Emit (telemetry/packageManagerExtensions.ts): adds recordPackageManagerDetection to the existing Telemetry class via the same declare module pattern as commandExtensions.ts. Keeps disk/Python-extension dependencies out of the telemetry client.
  • Collection (PackageManagerTelemetry.ts): a best-effort, non-blocking collector that reads disk and already-resolved interpreter metadata, runs the pure classifier, and calls the emit method. Deduplicated per session on (trigger, projectRoot); any failure degrades to unknown and is swallowed so it never disrupts setup.

Emission is wired into three setup touchpoints: project-open environment check (auto_open), the set-up-environment command (explicit_command), and first Run/Debug with Databricks Connect (run/debug).

A new Events.PYTHON_ENV_SETUP_DETECTED event carries a typed, documented schema (reuses the existing telemetry transport; opt-out honored; categorical data only — no paths, package names, or cluster names). A handoff note for the analytics/dashboard owner is included at src/telemetry/PACKAGE_MANAGER_DETECTION.md.

Detection correctness (the parts most worth reviewing):

  • interpreterSource is derived from the active interpreter alone, never from project files. A uv.lock project running a conda/venv/system interpreter reports that interpreter's real source, keeping the "uv project, interpreter not uv-managed yet" setup-flow gap visible. A genuinely uv-provisioned venv is identified by the uv = marker in pyvenv.cfg, not by uv.lock.
  • conda is attributed only when the active interpreter resides under CONDA_PREFIX (path-boundary checked), not on the bare env var — which is session-global in the extension host (launching VS Code from an activated conda shell) and would otherwise over-count conda for uv/poetry/pip projects.
  • pyproject [tool.uv]/[tool.poetry] detection uses a bounded table-header scan, not substring matching: ignores comments and in-value mentions, rejects prefix collisions (e.g. tool.uvicorn), and matches subtable and array-of-table headers ([tool.uv.sources], [[tool.poetry.source]]).
  • No external executable is run for telemetry: the uv-on-PATH probe was removed (it spawned a PATH-resolved uv for a weak, non-attributing signal). Detection reads only disk and already-resolved interpreter metadata.

Scope / privacy: measurement only — no changes to setup behavior (the VPEX flows are a separate effort). Only enum/categorical data and a closed set of signal identifiers are emitted; the existing telemetry opt-out (telemetry.telemetryLevel) is respected by the transport.

Tests

  • yarn run test:unit: 202 passing, 0 failing — includes the pure classifier (each manager, interpreter sources, overlaps like uv+pip / conda+pip / poetry+uv, weak signals, none) and pure helpers (pyprojectHasToolSection, pyvenvCfgMarksUv, interpreterUnderCondaPrefix), covering the conda-prefix boundary and shell-global false-positive cases.
  • yarn run build (typecheck) passes.
  • eslint clean; prettier formatted.

Reviewer can validate with:

cd packages/databricks-vscode
yarn run build
yarn run test:unit
npx eslint src --ext ts && npx prettier . -c

Why:
We need first-party data on which Python package manager(s) our users'
projects actually use (pip/conda/uv/poetry) to prioritize VPEX setup-flow
investment, replacing public-survey estimates. Measurement only -- no setup
behavior changes.

What:
- Add packageManagerDetection.ts: a pure, signal-based classifier that
  reports all applicable managers plus a best-guess primary (uv > poetry >
  conda > pip), the firing signals, hasLockfile, and interpreter source.
  Treats bare uv/poetry on PATH as weak signals.
- Add Events.PYTHON_ENV_SETUP_DETECTED with a typed, documented schema in
  telemetry/constants.ts (reuses existing Telemetry client; opt-out honored;
  categorical data only, no paths/package/cluster names).
- Add telemetry/packageManagerExtensions.ts: the emit half, layered onto the
  Telemetry class via the commandExtensions declare-module pattern
  (recordPackageManagerDetection). Keeps disk/Python-extension deps out of the
  Telemetry client.
- Add PackageManagerTelemetry.ts: the collection half -- a best-effort,
  non-blocking collector (disk + already-resolved interpreter metadata) that
  gathers signals, runs the pure classifier, and calls the emit method.
  Deduplicated per session on (trigger, projectRoot); failures degrade to
  unknown and are swallowed.
- Wire emission into three touchpoints: project-open env check (auto_open),
  the set-up-environment command (explicit_command), and first Run/Debug
  with Databricks Connect (run/debug).
- Add unit tests for the detector and pure helpers, and a dashboard-owner
  handoff note.

Detection correctness:
- interpreterSource is derived from the active interpreter alone, never from
  project files: a uv.lock project on a conda/venv/system interpreter reports
  that interpreter's real source, keeping the setup-flow gap visible. A
  genuinely uv-provisioned venv is identified by the `uv =` marker in
  pyvenv.cfg (pure pyvenvCfgMarksUv), not by uv.lock.
- conda is attributed only when the active interpreter resides under
  CONDA_PREFIX (pure interpreterUnderCondaPrefix, with a path-boundary check),
  not on the bare env var, which is session-global in the extension host
  (launching from an activated conda shell) and would otherwise over-count
  conda for uv/poetry/pip projects.
- pyproject [tool.uv]/[tool.poetry] detection uses a pure, bounded table-header
  scan (pyprojectHasToolSection) instead of substring matching: ignores
  comments and in-value mentions, rejects prefix collisions (e.g. tool.uvicorn),
  and matches subtable and array-of-table headers (e.g. [tool.uv.sources],
  [[tool.poetry.source]]) that the substring check missed.
- No external executable is run for telemetry: the uv-on-PATH probe was
  removed (it spawned a PATH-resolved `uv` for a weak, non-attributing signal);
  detection now only reads disk and already-resolved interpreter metadata.

Verification:
- yarn run build (typecheck) passes.
- eslint clean; prettier formatted.
- yarn run test:unit: 228 passing, 0 failing (includes detector + helper tests).

Co-authored-by: Isaac
@rugpanov rugpanov force-pushed the telemetry-package-manager-detection branch from c236f29 to be9f174 Compare June 19, 2026 09:26
@rugpanov rugpanov temporarily deployed to test-trigger-is June 19, 2026 09:26 — with GitHub Actions Inactive
@github-actions

Copy link
Copy Markdown
Contributor

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/vscode

Inputs:

  • PR number: 1918
  • Commit SHA: be9f1746227ce20e8261ed183a86775c0b99da9e

Checks will be approved automatically on success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant