Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
248 changes: 248 additions & 0 deletions docs/telemetry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
# Ops Telemetry

dws can emit one **anonymous, dimensions-only** ops metric per **command
invocation**, used to monitor error rate, latency, command distribution, and
version/platform health. It is the ops-side counterpart of [audit](./audit.md),
but deliberately **far smaller**:

- Collects **coarse dimensions only** — never object names, free text, peer ids,
device fingerprints, or natural-language input. There is no "redaction profile"
because there are no sensitive fields to redact in the first place.
- **Independent of audit**: unrelated to `DWS_AUDIT_*`; you can enable telemetry
without enabling compliance audit.
- **Default posture depends on the build** (see [Default posture](#default-posture)):
the open-source build is **off** (pure opt-in); a downstream distribution may
bake in a default endpoint and ship **on by default**, with a one-time
disclosure and an opt-out.

> This is an open-source CLI: the **public build never reports a byte and never
> hardcodes an endpoint**. Any on-by-default behavior lives only in a downstream
> build that injects its own endpoint — and even then it is disclosed once and
> can be opted out of.

## Enabling

| Environment variable | Description | Example |
|---|---|---|
| `DWS_TELEMETRY_ENABLED` | Explicitly enable/disable; overrides the build default either way | `true` / `false` |
| `DWS_TELEMETRY_DISABLED` | Hard opt-out; wins over everything (the off switch for on-by-default builds) | `true` |
| `DWS_TELEMETRY_FILE` | **Local file sink** — append each event as one JSON line here instead of POSTing (no server, no network). Takes precedence over URL | `~/.dws/telemetry.jsonl` |
| `DWS_TELEMETRY_URL` | Ingest endpoint; overrides the build-time default; one JSON event POSTed per invocation | `https://telemetry.example.com/dws` |
| `DWS_TELEMETRY_TOKEN` | Bearer auth for the endpoint (optional) | `xxxxx` |
| `DWS_TELEMETRY_TIMEOUT_MS` | Per-report timeout cap, in ms (default 1500) | `1500` |

## Default posture

`Enabled()` resolves like this:

1. `DWS_TELEMETRY_DISABLED=true` → **off** (always wins).
2. No destination (no `DWS_TELEMETRY_URL` and no baked-in default) → **off**.
3. `DWS_TELEMETRY_ENABLED` set → its value wins (`true`/`false`).
4. Otherwise → **on only if the build baked in a default endpoint**; a bare env
URL in the open-source build stays opt-in (off until `DWS_TELEMETRY_ENABLED=true`).

**Open-source build** → off; an operator opts in with `DWS_TELEMETRY_ENABLED=true`
plus a `DWS_TELEMETRY_URL`.

**Downstream "fleet" build (on by default)** → inject a default endpoint at build
time via `-ldflags`, so every install of that distribution reports to the
operator's own ingest out of the box (users opt out with
`DWS_TELEMETRY_DISABLED=true`):

```bash
go build -ldflags "\
-X github.com/DingTalk-Real-AI/dingtalk-workspace-cli/internal/telemetry.defaultURL=https://<your-fc-host>/dws \
-X github.com/DingTalk-Real-AI/dingtalk-workspace-cli/internal/telemetry.defaultToken=<token>" ./cmd
```

The public repo never hardcodes a real endpoint — only your build does. This
keeps "code is open source" and "data lands in the operator's own sink"
decoupled.

### One-time disclosure

The first time telemetry is active on a machine, dws prints a one-time notice to
stderr and writes a marker (`~/.dws/.telemetry_notice_shown`) so it never repeats:

```
ℹ️ dws reports anonymous operational telemetry (command, outcome, latency, version
— no content, no identity) to help monitor stability. Opt out anytime with
DWS_TELEMETRY_DISABLED=true. Details: docs/telemetry.md
```

## Local monitoring (lightest — no server, no SLS)

The smallest possible setup: point telemetry at a **local file**. No receiver, no
FC, no SLS — each machine appends its own events; you aggregate the file whenever.

```bash
# turn it on (file sink alone enables telemetry)
export DWS_TELEMETRY_FILE=~/.dws/telemetry.jsonl

# ... use dws normally ...

# one-line stability view (per command: calls / errors / avg latency)
python3 - <<'PY'
import json, collections, os
rows=[json.loads(l) for l in open(os.path.expanduser('~/.dws/telemetry.jsonl')) if l.strip()]
by=collections.defaultdict(lambda:{'n':0,'err':0,'dur':[]})
for r in rows:
k=f"{r.get('command')}.{r.get('subcommand')}"; b=by[k]
b['n']+=1; b['err']+=(r.get('outcome')!='ok'); b['dur'].append(r.get('duration_ms',0))
print(f"{'command':<28}{'calls':>6}{'err':>5}{'avg_ms':>8}")
for k,v in sorted(by.items(),key=lambda x:-x[1]['n']):
d=v['dur'] or [0]; print(f"{k:<28}{v['n']:>6}{v['err']:>5}{sum(d)//len(d):>8}")
PY
```

For a small fleet, collect each machine's `telemetry.jsonl` (rsync/scp) and run
the same aggregation over the combined files. Scale to the URL→ingest path only
when you outgrow this.

## Reported fields (complete)

```json
{
"schema_version": "1",
"ts": "2026-06-04T11:38:24+08:00",
"trace_id": "76a04f9eba0ad00c", // == transport execution_id, joinable with server-side logs
"corp_id": "ding...", // tenant dimension, best-effort (from the login token)
"cli_version": "1.0.34", // version health: "did this release break a command"
"channel": "openclaw", // which agent/integration drove the call (DWS_CHANNEL)
"os": "darwin", // coarse platform, not PII
"module": "doc",
"command": "doc",
"subcommand": "create_document",
"outcome": "ok", // ok | error
"err_class": "", // error category when outcome=error
"exit_code": 0,
"duration_ms": 73 // wall-clock latency of the call, used for P99
}
```

**Deliberately not collected** (verify the privacy boundary by reading the
struct): user identity (user_id / name), object names/ids, free text, device
id/serial, request/response body.

## Receiver contract

Any HTTP service can receive it:

```
POST /
Content-Type: application/json
Authorization: Bearer <token> # matches DWS_TELEMETRY_TOKEN
X-Dws-Telemetry-Schema: 1
Body: one telemetry event JSON
Return 2xx for success
```

## Local testing (zero dependencies, no SLS)

Before going to SLS, run the whole pipeline locally. Use
`fc-sls-ingest/localsink.py` (pure Python standard library, no `pip install`
needed) as the receiver:

```bash
# 1. Start the local receiver (with a test token)
cd docs/telemetry/fc-sls-ingest
TOKEN=dev python3 localsink.py # listens on 127.0.0.1:8799, writes /tmp/dws_telemetry.jsonl

# 2. In another terminal, point dws at it
export DWS_TELEMETRY_ENABLED=true
export DWS_TELEMETRY_URL=http://127.0.0.1:8799
export DWS_TELEMETRY_TOKEN=dev

# 3. Run a few commands (--mock needs no network or real backend, still emits telemetry)
dws doc create --title test --mock
dws drive list --mock
```

The receiver prints each event in real time and appends to
`/tmp/dws_telemetry.jsonl`. Things to verify:

- Events carry dimensions such as `command/outcome/duration_ms/cli_version/channel/os`;
- Compare command arguments (e.g. `--title test`) against the payload and confirm
the **content does not appear in the payload**;
- A POST without the token must be rejected (401).

Once written to disk, you can locally simulate the kind of metrics a dashboard
would compute:

```bash
python3 - <<'PY'
import json, collections
rows=[json.loads(l) for l in open('/tmp/dws_telemetry.jsonl') if l.strip()]
by=collections.defaultdict(lambda:{'n':0,'err':0,'dur':[]})
for r in rows:
k=f"{r['command']} {r['subcommand']}"; b=by[k]
b['n']+=1; b['err']+=(r['outcome']!='ok'); b['dur'].append(r.get('duration_ms',0))
for k,v in sorted(by.items(), key=lambda x:-x[1]['n']):
d=v['dur']; print(f"{k:<26}calls{v['n']:>4} err{v['err']:>3} avg{sum(d)//len(d):>5}ms max{max(d):>5}ms")
PY
```

> Note: telemetry is only emitted once a command actually reaches the MCP-call
> stage. If a command fails at argument parsing (before the call), no telemetry is
> produced — this is expected behavior.

## Boundary between open-source code and internal resources (public/private split)

dws is an open-source repository, but **which SLS the telemetry lands in and which
internal app it binds to is the deployer's own concern and never goes into the
repo**. This boundary is by design, not accident:

| | Where | Contains | In repo? |
|---|---|---|---|
| dws binary + the FC/local reference code in this dir | Public repo | Only POSTs to `DWS_TELEMETRY_URL`; **no endpoint, no secret, no app name** | ✅ |
| SLS Project / FC instance / real URL+token | Deployer's internal infra | Real address, auth, logstore; inside Alibaba it also binds to an internal app | ❌ Never in the repo; injected via env vars |

The code **never hardcodes any vendor reporting address**; the URL is always read
from an environment variable at runtime. So "code is public" and "data lands in
the deployer's internal SLS" are naturally decoupled: switching deployers is just
a different set of env vars, the repo needs no change, and no party's real config
is visible.

> Inside Alibaba: the SLS Project must hang under an AONE app (resource-governance
> requirement). Bind it to the app that owns the dws backend (e.g. the DingTalk
> MCP gateway app); that binding, the real URL, and the token all stay internal —
> the public repo is unaware of them.

## Wiring up Alibaba Cloud SLS (recommended for production)

SLS (Log Service) ships with ingest / storage / search / dashboards / alerting —
a standard choice for ops monitoring:

1. **Create the store**: in the SLS console create a Project + Logstore (e.g.
`dws-telemetry`), set retention days; index the fields `command` /
`subcommand` / `outcome` / `cli_version` / `corp_id` / `channel`, and set
`duration_ms` as a long-typed index (needed for P99).
2. **Create the receiver endpoint**: a **Function Compute (FC)** HTTP trigger is
the lowest-ops option — after validating the Bearer, write the body as a single
log via `PutLogs` into the Logstore (put the whole JSON in an `event` field and
also extract `command`/`outcome`/`duration_ms`/`cli_version` as indexed
columns).
3. **Roll out**: set the FC address as `DWS_TELEMETRY_URL` on each dws endpoint.

### Four ready-to-use alerts (SLS alert rules)

| Alert | SLS query (illustrative) | Trigger |
|---|---|---|
| Error-rate spike | `* \| select count_if(outcome='error')*1.0/count(*) as err_rate` | err_rate > 5% |
| P99 latency over budget | `* \| select approx_percentile(duration_ms, 0.99) as p99` | p99 > 3000 |
| One command failing broadly | `* \| select command, count_if(outcome='error') c group by command order by c desc` | c spikes for a single command |
| Call volume drops to zero | `* \| select count(*)` | == 0 within 5 minutes |

The alert notification channel can be a DingTalk bot directly.

## Where the data lands / two flows

- **Off = never leaves the machine.** dws ships no default vendor reporting address.
- **Enterprise self-hosted monitoring**: point `DWS_TELEMETRY_URL` at the
enterprise's own SLS ingest.
- **Platform-side unified monitoring**: point the URL at DingTalk's telemetry
ingest — technically possible, but must be opt-in + disclosed. Because this
telemetry **contains only anonymous dimensions**, the privacy boundary is clean
by construction, suitable for a platform ops dashboard.
- Full compliance trails are a separate track — use the enterprise's own sink via
[audit](./audit.md); don't mix it with telemetry.
6 changes: 6 additions & 0 deletions docs/telemetry/fc-sls-ingest/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
localsink.py
README.md
s.yaml
deploy.sh
Dockerfile
.dockerignore
22 changes: 22 additions & 0 deletions docs/telemetry/fc-sls-ingest/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# dws telemetry ingest — container image (AONE-deployable / any container platform).
#
# Build: docker build -t dws-telemetry-ingest .
# Run: docker run -p 9000:9000 -e INGEST_TOKEN=<token> dws-telemetry-ingest # dry-run
# docker run -p 9000:9000 -e INGEST_TOKEN=<token> \
# -e SLS_ENDPOINT=... -e SLS_PROJECT=... -e SLS_LOGSTORE=... dws-telemetry-ingest # -> SLS
#
# For AONE: point the app's build at this Dockerfile; expose port 9000; set the
# env vars (INGEST_TOKEN required; SLS_* to write to an internal SLS Logstore).
# Grant the running identity SLS PutLogs so app.py uses injected creds.
FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .

ENV PORT=9000
EXPOSE 9000

# gunicorn web server; app.py auto-detects dry-run vs SLS mode from env.
CMD ["gunicorn", "-b", "0.0.0.0:9000", "--workers", "2", "--timeout", "30", "app:app"]
118 changes: 118 additions & 0 deletions docs/telemetry/fc-sls-ingest/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# dws Telemetry Receiver (Function Compute FC → SLS)

This is the **reference receiver** for [Ops Telemetry](../../telemetry.md): dws
POSTs one telemetry JSON over, but SLS cannot accept a raw POST directly (writes
must be signed), so this minimal HTTP service sits in between — it validates the
token and writes to SLS via `PutLogs`. Deploy it as a Function Compute (FC) **web
function**; you don't have to deal with the FC handler signature.

```
dws ──POST one JSON──▶ this service (FC web fn) ──PutLogs──▶ SLS Logstore ──▶ dashboard/alerts
```

## Files

- `app.py` — Flask service: `POST /` validates Bearer → parses JSON → writes SLS; `GET /` health check
- `requirements.txt` — dependencies (flask / gunicorn / aliyun-log-python-sdk)

## 1. Create the store in SLS first (a few clicks in the console)

1. Create a **Project** (e.g. `dws-ops`) and a **Logstore** (e.g.
`dws-telemetry`), set retention days.
2. Enable indexes: set `command` / `subcommand` / `outcome` / `cli_version` /
`corp_id` / `channel` as **text**; set `duration_ms` / `exit_code` as **long**
(needed for P99 and aggregation).

## Two run modes (auto-detected)

`app.py` switches automatically by environment variables — **no code change**:

| Mode | Trigger | Behavior |
|---|---|---|
| **dry-run** | Any SLS variable missing, or `TELEMETRY_DRYRUN=true` | Received events are printed to stdout (FC captures this in function logs) and return 204. **Does not require the aliyun-log SDK** — good for validating the pipeline first |
| **SLS** | `SLS_ENDPOINT`+`SLS_PROJECT`+`SLS_LOGSTORE` all set | After validation, `PutLogs` writes into the Logstore |

The `GET /` health check echoes the current mode (`mode=dry-run` / `mode=sls`),
so it's obvious right after deploy.

## 2. Deploy this service as an FC web function

1. Function Compute console → Create function → **Web function** → Python runtime.
2. Upload this directory's code (incl. `requirements.txt`; FC installs deps
automatically).
3. **Startup command**: `gunicorn -b 0.0.0.0:9000 app:app`, **listen port** `9000`.
4. **Dry-run validation first (strongly recommended)**: on the first deploy set
only `INGEST_TOKEN` and **leave the SLS variables unset** (or add
`TELEMETRY_DRYRUN=true`). After deploy, `GET /` should show `mode=dry-run`;
point dws at it, run a few commands, and you'll see `DRYRUN {...}` lines in FC's
**function logs** — proving the "client → FC" leg works. This step **needs no
SLS, no store, no SDK**.
5. **Then wire up SLS**: **bind a service role** to the function and grant
`AliyunLogFullAccess` (or a narrower PutLogs permission) — this way you don't
put an AccessKey in env vars; FC injects STS temporary credentials and `app.py`
reads them first. Then add the SLS env vars; once `GET /` becomes `mode=sls`
it's live:

| Variable | Value | Note |
|---|---|---|
| `SLS_ENDPOINT` | `cn-hangzhou.log.aliyuncs.com` | change to your region |
| `SLS_PROJECT` | `dws-ops` | the Project from step 1 |
| `SLS_LOGSTORE` | `dws-telemetry` | the Logstore from step 1 |
| `INGEST_TOKEN` | a random string you generate | must match dws-side `DWS_TELEMETRY_TOKEN` |

6. After deploy, grab the function's HTTP trigger address (like
`https://xxx.cn-hangzhou.fcapp.run`).

## 3. Wire dws up

In the environment where dws runs (or injected by the host agent):

```bash
export DWS_TELEMETRY_ENABLED=true
export DWS_TELEMETRY_URL="https://xxx.cn-hangzhou.fcapp.run" # the function address from above
export DWS_TELEMETRY_TOKEN="<same random string as INGEST_TOKEN>"
```

Run a few commands and you'll see records appear in the SLS Logstore query page.

## 4. Validate locally first (optional, no FC / SLS needed)

The simplest local validation uses `localsink.py` (pure standard library, zero
deps), see [the "Local testing" section in telemetry.md](../../telemetry.md#local-testing-zero-dependencies-no-sls).

You can also run this service's **dry-run mode** locally (no SLS, no aliyun-log):

```bash
cd docs/telemetry/fc-sls-ingest
pip install flask # dry-run only needs flask; aliyun-log is only for SLS mode
INGEST_TOKEN=dev python3 app.py # no SLS_* -> auto dry-run, listens on :9000
# in another terminal:
curl -s localhost:9000/ # should echo mode=dry-run
curl -XPOST localhost:9000/ -H 'Authorization: Bearer dev' \
-H 'Content-Type: application/json' \
-d '{"schema_version":"1","command":"doc","outcome":"ok","duration_ms":42}'
# returns 204; the event prints as DRYRUN {...} in the app.py terminal.
```

To validate against real SLS locally, add `SLS_ENDPOINT/SLS_PROJECT/SLS_LOGSTORE`
and an AccessKey (`pip install -r requirements.txt` to install aliyun-log), and
`GET /` will become `mode=sls`.

## 5. Configure alerts (SLS console → Alerts)

| Alert | Query (illustrative) | Trigger |
|---|---|---|
| Error-rate spike | `* \| select count_if(outcome='error')*1.0/count(*) as err_rate` | err_rate > 0.05 |
| P99 latency over budget | `* \| select approx_percentile(duration_ms, 0.99) as p99` | p99 > 3000 |
| One command failing broadly | `* \| select command, count_if(outcome='error') c group by command order by c desc` | c spikes for a single command |
| Call volume drops to zero | `* \| select count(*) as n` | n == 0 (5-minute window) |

The notification channel can be a DingTalk bot webhook directly.

## Security notes

- Use a strong random string for `INGEST_TOKEN`, keep it in sync with the dws
side, and never leave it empty.
- Prefer the FC service role (STS); do not put a long-lived AccessKey in env vars.
- This service only accepts **anonymous dimension** data — no user content or
identity; the privacy boundary is guaranteed by the dws client.
Loading
Loading