From 023f97735e208740a1ec251f36c576c72875480f Mon Sep 17 00:00:00 2001 From: Dale McDiarmid Date: Fri, 12 Jun 2026 19:44:52 +0100 Subject: [PATCH 1/4] update skill --- .../skills/clickstack-otel-collector/SKILL.md | 466 ++++++++++++------ 1 file changed, 314 insertions(+), 152 deletions(-) diff --git a/static/skills/clickstack-otel-collector/SKILL.md b/static/skills/clickstack-otel-collector/SKILL.md index 2781a8b14ed..10ec87dd0df 100644 --- a/static/skills/clickstack-otel-collector/SKILL.md +++ b/static/skills/clickstack-otel-collector/SKILL.md @@ -1,181 +1,293 @@ --- name: clickstack-otel-collector -description: Use when a user wants to set up the OpenTelemetry collector for Managed ClickStack on a ClickHouse Cloud service, send logs/traces/metrics, and verify the data lands in ClickStack. +description: Use when a user wants to stand up a local Docker OpenTelemetry collector for Managed ClickStack on a ClickHouse Cloud service, send logs/traces/metrics, and verify the data is visible in ClickStack. license: Apache-2.0 metadata: author: ClickHouse Inc - version: "0.1.0" + version: "0.2.0" --- -# Set up the ClickStack OpenTelemetry collector +# Set up the ClickStack OpenTelemetry collector (local Docker) -This skill walks an agent through wiring an OpenTelemetry collector into a Managed ClickStack service running on ClickHouse Cloud. It uses [`clickhousectl`](https://clickhouse.com/docs/interfaces/cli) for all cloud and SQL operations. +This skill wires a **local Docker** OpenTelemetry collector into a Managed ClickStack +service running on ClickHouse Cloud, sends synthetic telemetry through it, and confirms +the data is actually visible in ClickStack. It uses +[`clickhousectl`](https://clickhouse.com/docs/interfaces/cli) for all cloud and SQL +operations. + +**Scope.** This is deliberately the *local Docker collector* path. It is one of several +things a user might want: they may already run a collector and only need exporter config, +or they may want a Kubernetes deployment with secrets in K8s Secrets. Those are out of +scope here. If the user clearly wants one of those instead, say so and stop, rather than +building a local container they did not ask for. The end state is: -- A dedicated `hyperdx_ingest` SQL user on the target service. -- A ClickStack-distribution OpenTelemetry collector running locally, accepting OTLP on `4317`/`4318` and writing into the `otel` database on the target service. -- Synthetic telemetry exercising the pipeline. -- A URL the user can open to view the data in ClickStack. +- A dedicated `hyperdx_ingest` SQL user on the target service, with exactly the grants + this collector image needs (including the `default.*` grant its migrations require). +- A ClickStack-distribution OpenTelemetry collector running locally as a Docker container, + accepting OTLP on `4317`/`4318`, exposing a health endpoint on `13133`, and writing into + the `otel` database on the target service. +- Synthetic telemetry exercising the logs, traces, and metrics pipelines. +- The service confirmed **awake**, and the user pointed to the ClickHouse Cloud console to + complete ClickStack onboarding (Getting Started, auto-detect sources) so they can actually + *see* their data. -Follow these steps in order. Do not skip ahead — each step depends on state established by the previous one. +Secrets (the OTLP auth token and the SQL password) are generated locally, written **once** +to a `0600` env file, and passed to Docker via `--env-file`. They are never pasted into the +chat, never passed with `docker run -e`, and never echoed back after creation. ---- +Follow these steps in order. Each step depends on state established by the previous one. -## Before you begin: heads-up on permissions +--- -Coding agents typically ask the user to approve each shell command the first time they see it. To keep this run smooth, tell the user up front what categories of commands you'll need to run, and ask them to grant the permissions in one batch — "allow always" / "approve for session" in their agent — for each category below. That way you won't have to interrupt them every few steps. +## Step 0: Batch the permissions up front -You will run, in order: +Coding agents prompt for approval the first time they see each shell command. To avoid +interrupting the user every few steps, ask them once, up front, to allowlist these command +prefixes (the "always allow for this project / session" option in their agent). There are no +destructive operations and nothing targets anything outside this project or their ClickHouse +Cloud service: -1. **`openssl rand …`** — to generate a random OTLP auth token and a SQL password. -2. **`clickhousectl …`** — to authenticate, look up the service, and run SQL via the Query API. Specifically: `cloud auth status`, `cloud auth login`, `cloud service get`, `cloud service query`. -3. **`docker info`, `docker run`, `docker ps`, `docker logs`** — to deploy and observe the OpenTelemetry collector container. -4. **`otelgen …`** — to send synthetic logs, traces, and metrics through the collector. -5. **`jq …`** and **`curl …`** — small text/JSON utilities and one local HTTP healthcheck. +| Command prefix | Used for | +| --- | --- | +| `openssl rand …` | generate the OTLP token and SQL password | +| `clickhousectl cloud …` | auth status, resolve the service, run SQL via the Query API | +| `docker …` | run/inspect the collector and the telemetry generator | +| `curl …` | one local health check against `localhost:13133` | +| `jq …` | parse JSON from `clickhousectl` | -Tell the user: *"If your agent supports it, choose 'always allow' for each command class above the first time it asks — there are no destructive operations and no commands targeting anything outside this project / your ClickHouse Cloud service."* +Tell the user, in your own words: *"If your agent supports it, choose 'always allow' for +each of these the first time it asks. The whole run is read-only against your machine except +for one Docker container, and write operations against ClickHouse are limited to creating the +ingest user and the `otel` schema."* Then continue. --- -## Step 1: Confirm the target service +## Step 1: Confirm the target service and lay down the secrets file -The user's prompt contains a service identifier — either a service ID (UUID) or a service name. Treat that value as `SERVICE_REF`. +The user's prompt contains a service identifier, either a service ID (UUID) or a service +name. Treat that value as `SERVICE_REF`. -**Ask the user to confirm**, and capture three values in your working memory: +Create a working directory and a **`0600` env file** that will hold all configuration and +secrets for this run. The key names match exactly what the collector image reads, so this +same file is passed straight to `docker run --env-file` in Step 5. Write it under a tight +`umask` so the secret is never briefly world-readable: -1. `SERVICE_REF` — what they gave you. -2. `OTLP_AUTH_TOKEN` — a shared secret the collector will require on inbound OTLP requests. Generate a random one (e.g. `openssl rand -hex 16`) unless the user provides one. -3. `HYPERDX_PASSWORD` — the password for the `hyperdx_ingest` SQL user. Generate a strong random password (e.g. `openssl rand -base64 32`) unless the user provides one. +```bash +WORKDIR="${WORKDIR:-$HOME/clickstack-otel-collector}" +mkdir -p "$WORKDIR" && chmod 700 "$WORKDIR" +ENV_FILE="$WORKDIR/collector.env" + +# Generate secrets WITHOUT printing them; write straight into a private file. +( umask 177 + { + echo "SERVICE_REF=$SERVICE_REF" + echo "OTLP_AUTH_TOKEN=$(openssl rand -hex 32)" + echo "CLICKHOUSE_USER=hyperdx_ingest" + echo "CLICKHOUSE_PASSWORD=$(openssl rand -hex 24)Aa1-" + echo "HYPERDX_OTEL_EXPORTER_CLICKHOUSE_DATABASE=otel" + } > "$ENV_FILE" +) +chmod 600 "$ENV_FILE" +ls -l "$ENV_FILE" +``` -Echo the values back to the user with a one-line note that the OTLP token and the SQL password are sensitive. Do **not** print the password again after this confirmation. +Two things about these values matter and are easy to get wrong: ---- +- **Key names are exact.** The collector reads `CLICKHOUSE_USER`, `CLICKHOUSE_PASSWORD`, + `CLICKHOUSE_ENDPOINT`, and `HYPERDX_OTEL_EXPORTER_CLICKHOUSE_DATABASE`. Store the SQL + password under `CLICKHOUSE_PASSWORD` (not a custom name); if it is missing, the collector + starts with an **empty** password and dies with `code: 516, Authentication failed`. +- **The password charset is constrained from three directions at once.** ClickHouse Cloud + rejects passwords without at least one uppercase character and one special character, so a + plain hex string fails at `CREATE USER`. At the same time, the collector's migration tool + embeds the password in a connection URL, so `@`, `:`, `/`, `?`, `#`, and `%` corrupt it + (symptom: `code: 516` at startup even though the password is "correct"). The recipe above + is random hex (lowercase + digits) plus the suffix `Aa1-`, which adds the required + uppercase, a digit, and a **URL-unreserved** special character (`-`). The OTLP token has no + such rules (it is just a bearer token), so plain hex is fine for it. -## Step 2: Install and authenticate `clickhousectl` +The env file uses **bare `KEY=VALUE` lines with no quotes**: Docker's `--env-file` does not +do shell parsing, so any quotes you add become part of the value. The charset above needs no +quoting anywhere (SQL, the env file, or `"$VAR"` in the shell). -Check `clickhousectl` is on `PATH`: +From now on, load the file when you need a value instead of typing secrets: ```bash -which clickhousectl +set -a; . "$ENV_FILE"; set +a ``` -If it is missing, install it: +**Confirm with the user** that `SERVICE_REF` is correct. Tell them the working directory and +that `collector.env` (mode `0600`) now holds the OTLP token and the SQL password. Do **not** +print either secret. If they want to see a value, point them at the file +(`grep OTLP_AUTH_TOKEN "$ENV_FILE"`). + +If the user supplied their own token or password, write those into the file instead of the +generated ones, but keep the same `0600` discipline and make sure any custom password still +meets the charset rules above. + +--- + +## Step 2: Authenticate `clickhousectl` (separate terminal by default) + +Check `clickhousectl` is on `PATH`: ```bash -curl -fsSL https://clickhouse.com/cli | sh +which clickhousectl || curl -fsSL https://clickhouse.com/cli | sh ``` -Then verify authentication: +Check authentication: ```bash clickhousectl cloud auth status ``` -The skill needs **API key authentication** (OAuth is read-only and cannot create users or run write queries). If the `API key` row is not `Active`, ask the user for credentials: +This skill needs **API key authentication**: OAuth is read-only and cannot create users or +run write queries. If the `API key` row is not `Active`, the user must authenticate. -> I need a ClickHouse Cloud API key to create the ingestion user and verify the data. -> -> **To get one**, open the [ClickHouse Cloud console](https://console.clickhouse.cloud), open the **Organization** menu on the left nav, choose **API keys**, then **New API key**. Select the **Admin** role — Developer-scoped keys can't auto-provision the per-service query endpoint that `cloud service query` uses. -> -> **Then, either:** +**Do not ask the user to paste their API key and secret into the chat.** Anything pasted +into the conversation lives in the transcript and has to be rotated afterward. Instead, ask +them to authenticate in a **separate terminal**, then tell you when they are done: + +> I need a ClickHouse Cloud **Admin** API key to create the ingest user and verify the data. +> Please don't paste it here. Instead: > -> - **Paste them in your next message** and I'll authenticate from this session, **or** -> - **Authenticate yourself in a separate terminal** with: +> 1. In the [Cloud console](https://console.clickhouse.cloud), open **Organization → API keys +> → New API key**, and give it the **Admin** role. (Developer-scoped keys can't provision +> the per-service Query API endpoint that `cloud service query` uses.) +> 2. In a **separate terminal**, run: > -> ```bash -> clickhousectl cloud auth login --api-key --api-secret -> ``` +> ```bash +> clickhousectl cloud auth login --api-key --api-secret +> ``` > -> …and tell me when you're done — I'll poll `clickhousectl cloud auth status` until the API key row shows `Active`. +> 3. Tell me when that's done and I'll re-check the auth status. -If the user pastes credentials, run the login command yourself (do not echo the secret back) and then verify: +Poll until the API key row reports `Active`, then confirm with a real privileged call rather +than trusting the status table alone: ```bash clickhousectl cloud auth status +clickhousectl cloud service get "$SERVICE_REF" --json | jq -r '.name, .state' +``` + +If the service resolves, you are authenticated; continue. + +**Fallback if a separate-terminal login isn't picked up.** Some `clickhousectl` builds save +the credentials file but a freshly spawned shell (such as the one your tool calls run in) +doesn't read it, so `auth status` keeps showing `Not configured`. In that case, load the +saved credentials into the environment for your own calls, without printing them: + +```bash +export CLICKHOUSE_CLOUD_API_KEY="$(jq -r .api_key "$HOME/.clickhouse/credentials.json")" +export CLICKHOUSE_CLOUD_API_SECRET="$(jq -r .api_secret "$HOME/.clickhouse/credentials.json")" ``` -Do not continue until the API key row is `Active`. +Re-run the `service get` check above; the `Env vars` row should now read `Active`. This still +keeps the secret out of the chat. Do not continue until a real call succeeds. --- ## Step 3: Resolve the service and capture the HTTPS endpoint -Find the target service. If `SERVICE_REF` looks like a UUID, use it directly; otherwise, look it up by name: +Load the env file (`set -a; . "$ENV_FILE"; set +a`) and resolve the service. If `SERVICE_REF` +is a UUID, use it directly; otherwise look it up by name: ```bash # UUID form -clickhousectl cloud service get "$SERVICE_REF" --json +clickhousectl cloud service get "$SERVICE_REF" --json > "$WORKDIR/svc.json" -# Name form (search across orgs) -clickhousectl cloud service list --json | jq '.[] | select(.name=="")' +# Name form (note the double quotes: service names can contain spaces or apostrophes, +# e.g. "Alex's test") +clickhousectl cloud service list --json \ + | jq --arg n "$SERVICE_REF" '.[] | select(.name==$n)' > "$WORKDIR/svc.json" ``` -From the JSON, extract: - -- `id` → `SERVICE_ID` -- `name` → `SERVICE_NAME` -- `state` — must be `running`. If it is `stopped` or `starting`, ask the user to start the service or wait; do not proceed. -- The `endpoints` array entry with `"protocol": "https"` → `HTTPS_ENDPOINT_HOST` and `HTTPS_ENDPOINT_PORT` (typically `8443`). - -Note: `port` in the JSON is a number that may serialize as a float (`8443.0`). Coerce it to an integer when building the URL — for example with `jq`: +Extract the values you need, coercing the port to an integer. The port serializes as a +float (`8443.0`); if `:8443.0` leaks into the endpoint the collector's ClickHouse exporter +cannot dial it: ```bash -HTTPS_ENDPOINT=$(jq -r '.endpoints[] | select(.protocol=="https") | "https://\(.host):\(.port | tonumber | floor)"' /tmp/svc.json) +SERVICE_ID=$(jq -r '.id' "$WORKDIR/svc.json") +SERVICE_NAME=$(jq -r '.name' "$WORKDIR/svc.json") +STATE=$(jq -r '.state' "$WORKDIR/svc.json") +CLICKHOUSE_ENDPOINT=$(jq -r '.endpoints[] | select(.protocol=="https") + | "https://\(.host):\(.port | tonumber | floor)"' "$WORKDIR/svc.json") + +# Persist the resolved values back into the env file for later steps and docker --env-file. +{ echo "SERVICE_ID=$SERVICE_ID" + echo "CLICKHOUSE_ENDPOINT=$CLICKHOUSE_ENDPOINT" +} >> "$ENV_FILE" +printf 'service=%q state=%s endpoint=%s\n' "$SERVICE_NAME" "$STATE" "$CLICKHOUSE_ENDPOINT" ``` -That produces: - -``` -CLICKHOUSE_ENDPOINT=https://:8443 -``` - -Do not let `:8443.0` leak through — the OTel collector's ClickHouse exporter will fail to dial it. - -Sanity-check the service is reachable via the query API: +`STATE` must be `running`. If it is `stopped` or `starting`, ask the user to start the +service (or wait), and do not proceed. ClickHouse Cloud services **idle-suspend**, so even a +"running" service can be asleep; the next query both checks reachability and wakes it: ```bash clickhousectl cloud service query --id "$SERVICE_ID" --query "SELECT version()" ``` -A successful response confirms both the service and the per-service query endpoint key were provisioned. If this is the first time, `clickhousectl` will print `Provisioning Query API endpoint + key for service ''...` — that is expected. +A successful response confirms the service is awake and that the per-service Query API key is +provisioned. On the first call `clickhousectl` prints `Provisioning Query API endpoint + key +for service ''...`, which is expected. --- -## Step 4: Create the `hyperdx_ingest` SQL user +## Step 4: Create the `hyperdx_ingest` SQL user and grant it `otel.*` -Create the user and grant it the minimum privileges the ClickStack OTel collector needs to create the `otel` database and write into it: +The user name is fixed and the password charset (Step 1) needs no escaping, so single-quoting +it in SQL is safe. Load the env file first so `$CLICKHOUSE_PASSWORD` is set: ```bash +set -a; . "$ENV_FILE"; set +a + clickhousectl cloud service query --id "$SERVICE_ID" --query \ - "CREATE USER hyperdx_ingest IDENTIFIED WITH sha256_password BY '$HYPERDX_PASSWORD'" + "CREATE USER IF NOT EXISTS hyperdx_ingest IDENTIFIED WITH sha256_password BY '$CLICKHOUSE_PASSWORD'" +# If the user already existed from a prior run, force the password to this run's value. clickhousectl cloud service query --id "$SERVICE_ID" --query \ - "GRANT SELECT, INSERT, CREATE DATABASE, CREATE TABLE, CREATE VIEW ON otel.* TO hyperdx_ingest" + "ALTER USER hyperdx_ingest IDENTIFIED WITH sha256_password BY '$CLICKHOUSE_PASSWORD'" ``` -If `CREATE USER` fails with `already exists`, rotate the password instead so this run uses a known value: +Grant the least privilege the collector needs to create and write the `otel.*` schema. On the +current image the schema migrations and their version table also live in `otel`, so `otel.*` +is sufficient: ```bash clickhousectl cloud service query --id "$SERVICE_ID" --query \ - "ALTER USER hyperdx_ingest IDENTIFIED WITH sha256_password BY '$HYPERDX_PASSWORD'" + "GRANT SELECT, INSERT, CREATE DATABASE, CREATE TABLE, CREATE VIEW ON otel.* TO hyperdx_ingest" ``` -Verify the grants: +> **Older image builds:** some earlier collector versions ran their goose migrations against a +> version table in the `default` database, so startup looped on `ACCESS_DENIED` until +> `default.*` was also granted. If you see `ACCESS_DENIED` referencing `default` in the +> collector logs (Step 5), add this and restart the container: +> +> ```bash +> clickhousectl cloud service query --id "$SERVICE_ID" --query \ +> "GRANT SELECT, INSERT, CREATE TABLE ON default.* TO hyperdx_ingest" +> ``` + +Verify: ```bash clickhousectl cloud service query --id "$SERVICE_ID" --query "SHOW GRANTS FOR hyperdx_ingest" ``` -You should see `GRANT SELECT, INSERT, CREATE DATABASE, CREATE TABLE, CREATE VIEW ON otel.* TO hyperdx_ingest`. +You should see `GRANT SELECT, INSERT, CREATE DATABASE, CREATE TABLE, CREATE VIEW ON otel.* TO +hyperdx_ingest`. --- ## Step 5: Deploy the ClickStack OpenTelemetry collector -Run the ClickStack-distribution collector locally. It is preconfigured for Managed ClickStack — it creates the `otel.*` schema on first use and routes Session Replay events to `otel.hyperdx_sessions`. +Run the ClickStack-distribution collector locally. It creates the `otel.*` schema on first +write and routes Session Replay events to `otel.hyperdx_sessions`. Make sure Docker is running: @@ -183,89 +295,103 @@ Make sure Docker is running: docker info > /dev/null ``` -Start the collector: +Create a user-defined network. The collector joins it so the telemetry generator in Step 6 +can reach it by container name without any local install: + +```bash +docker network create clickstack-net 2>/dev/null || true +``` + +Start the collector, passing **all secrets via `--env-file`** (never `-e`, which would put +the secret on the command line, in shell history, and in `ps`). Expose the health port too. +The `docker rm -f` first makes the step safe to re-run: ```bash +docker rm -f clickstack-otel-collector 2>/dev/null || true docker run -d \ --name clickstack-otel-collector \ - -e OTLP_AUTH_TOKEN="$OTLP_AUTH_TOKEN" \ - -e CLICKHOUSE_ENDPOINT="$CLICKHOUSE_ENDPOINT" \ - -e CLICKHOUSE_USER=hyperdx_ingest \ - -e CLICKHOUSE_PASSWORD="$HYPERDX_PASSWORD" \ - -e HYPERDX_OTEL_EXPORTER_CLICKHOUSE_DATABASE=otel \ + --network clickstack-net \ + --env-file "$ENV_FILE" \ -p 4317:4317 \ -p 4318:4318 \ + -p 13133:13133 \ clickhouse/clickstack-otel-collector:latest ``` -Confirm it is healthy: +The image reads `OTLP_AUTH_TOKEN`, `CLICKHOUSE_ENDPOINT`, `CLICKHOUSE_USER`, +`CLICKHOUSE_PASSWORD`, and `HYPERDX_OTEL_EXPORTER_CLICKHOUSE_DATABASE` from the env file. It +enables bearer-token auth on the OTLP receiver with an empty scheme, so callers send the raw +token as the `authorization` header (no `Bearer ` prefix). + +Confirm it is healthy. The health check needs no install: ```bash docker ps --filter name=clickstack-otel-collector --format '{{.Status}}' -docker logs --tail 30 clickstack-otel-collector 2>&1 | tail -30 +curl -fsS http://localhost:13133/ && echo +docker logs --tail 40 clickstack-otel-collector 2>&1 | tail -40 ``` -The logs should contain `Everything is ready. Begin running and processing data.` (or equivalent), and no repeated `clickhouseexporter` connection errors. If you see TLS or auth errors, the most common cause is a malformed `CLICKHOUSE_ENDPOINT` (must include `https://` and the `:8443` port). +A healthy start shows the seed migrations running to completion (`[seed] OK ...` lines ending +in `goose: up to current file version: N`), then `Everything is ready. Begin running and +processing data.` (or equivalent), `docker ps` reporting `Up ... (healthy)`, and the health +check returning HTTP 200. If instead the container exits, the cause is almost always in the +seed step: + +- `code: 516, Authentication failed: password is incorrect` → `CLICKHOUSE_PASSWORD` is empty + or wrong in the env file. The most common slip is storing the password under a different key + name (it **must** be `CLICKHOUSE_PASSWORD`), or using a password containing `@ : / ? # %`, + which corrupts the migration tool's connection URL. +- `[HTTP 403]` / `data size should be 0 < ` at "server hello" → same root cause: + an empty/wrong password against the HTTPS endpoint. +- TLS / dial errors → `CLICKHOUSE_ENDPOINT` is malformed (it must be `https://:8443`, + with no `.0` on the port). +- `ACCESS_DENIED` referencing `default` → only on older image builds; apply the `default.*` + grant from the Step 4 note and restart. --- ## Step 6: Send synthetic telemetry and verify ingestion -Install [`otelgen`](https://github.com/krzko/otelgen). On macOS: - -```bash -brew install krzko/tap/otelgen -``` - -Otherwise, with Go: - -```bash -go install github.com/krzko/otelgen@latest -``` - -### Time-box every otelgen call +Use `telemetrygen` (the OpenTelemetry Collector Contrib generator). Run it from its **Docker +image** on the same network, so nothing is installed on the host. Its `--duration` flag +terminates the run reliably, so no watchdog wrapper is needed. -`otelgen`'s flag handling is inconsistent — `--duration` is honored on some subcommands and silently ignored on others (notably `metrics`, but in practice not reliable on `logs multi` / `traces multi` either, depending on the build). So **do not trust `--duration`** to terminate the process. Instead, wrap every call in a small bash helper that backgrounds it, waits a bounded number of seconds, and then sends `SIGINT` (followed by `SIGKILL` if it ignores the interrupt): +Load the env file so the token is available as a variable, then reference `$OTLP_AUTH_TOKEN` +so the literal token never appears in the command text, your output, or shell history. +`telemetrygen`'s header syntax requires the value to be a quoted string: `key="value"`. ```bash -run_otelgen() { - # usage: run_otelgen - local secs="$1"; shift - otelgen --otel-exporter-otlp-endpoint localhost:4317 --insecure --protocol grpc \ - --header "authorization=$OTLP_AUTH_TOKEN" --rate 5 "$@" & - local pid=$! - ( sleep "$secs" && kill -INT "$pid" 2>/dev/null \ - && sleep 3 && kill -KILL "$pid" 2>/dev/null ) & - local watchdog=$! - wait "$pid" 2>/dev/null || true - kill "$watchdog" 2>/dev/null || true +set -a; . "$ENV_FILE"; set +a + +TG_IMAGE=ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest +tg() { + # usage: tg [extra telemetrygen flags...] + local signal="$1"; shift + docker run --rm --network clickstack-net "$TG_IMAGE" "$signal" \ + --otlp-endpoint clickstack-otel-collector:4317 \ + --otlp-insecure \ + --otlp-header "authorization=\"$OTLP_AUTH_TOKEN\"" \ + --rate 5 --duration 10s "$@" } -``` -Define this once, then send a short burst of each signal: - -```bash -run_otelgen 20 logs multi -run_otelgen 20 traces multi -run_otelgen 15 metrics sum +tg logs +tg traces +tg metrics --metric-type Sum ``` -Each call returns when the wall-clock budget expires — no hung processes, no stalled scripts. +(`--metric-type` accepts `Gauge`, `Sum`, `Histogram`, or `ExponentialHistogram`. Use `--otlp-http` +with `--otlp-endpoint clickstack-otel-collector:4318` if you want to exercise the HTTP path +instead of gRPC.) -Notes on the synthetic data: - -- `logs multi` and `traces multi` emit several events per tick; 20 seconds is plenty to populate `otel_logs` and `otel_traces`. -- `metrics sum` only emits one data point every few seconds — 15 seconds yields 2–3 points, which is enough to verify the metrics path. -- See the [otelgen synthetic-data guide](/use-cases/observability/clickstack/getting-started/otelgen) for the other `metrics` subcommands (`gauge`, `histogram`, `exponential-histogram`). - -Wait ~15 seconds for the collector batch flush. The ClickStack collector creates its tables on first write — confirm they exist: +Wait ~15 seconds for the collector to flush its batch, then confirm the tables exist: ```bash clickhousectl cloud service query --id "$SERVICE_ID" --query \ "SELECT name FROM system.tables WHERE database='otel' ORDER BY name" ``` -Expect at minimum `otel_logs`, `otel_traces`, and one or more `otel_metrics_*` tables. Then confirm rows are landing in each signal. The schema uses standard upstream ClickHouse-exporter column names (`Timestamp` on logs/traces, `TimeUnix` on metrics) — but rather than hard-coding column names, count by `parts.rows` on the underlying parts which is signal-agnostic: +Then confirm rows are landing. Count by `parts.rows`, which is signal-agnostic and avoids +hard-coding per-signal column names: ```bash clickhousectl cloud service query --id "$SERVICE_ID" --query \ @@ -276,56 +402,90 @@ clickhousectl cloud service query --id "$SERVICE_ID" --query \ 'otel_metrics_sum','otel_metrics_gauge', 'otel_metrics_histogram','otel_metrics_exponential_histogram', 'otel_metrics_summary') - GROUP BY table - ORDER BY table" + GROUP BY table ORDER BY table" ``` -You should see non-zero `rows` for `otel_logs`, `otel_traces`, and at least one `otel_metrics_*` table (`otel_metrics_sum` if you used the `metrics sum` command above). If any signal is missing: +You should see non-zero `rows` for `otel_logs`, `otel_traces`, and `otel_metrics_sum`. If a +signal is missing: 1. Tail the collector logs (`docker logs --tail 50 clickstack-otel-collector`) for export errors. -2. Confirm the `authorization` header sent by `otelgen` matches `$OTLP_AUTH_TOKEN`. -3. Re-check `CLICKHOUSE_ENDPOINT` includes the `:8443` port and `https://` scheme. -4. Some metric kinds take longer to flush — re-run the query after another 30 seconds before declaring failure. +2. Confirm the `authorization` header matches `$OTLP_AUTH_TOKEN` (a mismatch shows in the + generator output as `code = Unauthenticated desc = provided authorization does not match + expected scheme or token`). +3. Re-check `CLICKHOUSE_ENDPOINT` has the `https://` scheme and `:8443` port. +4. Some metric kinds flush slowly. Re-run the count after another 30 seconds before declaring failure. Do not proceed until every expected signal has non-zero rows. --- -## Step 7: Summarize the result and hand off +## Step 7: Own the last mile in ClickStack (wake the service, then complete onboarding) -The collector is running as a local Docker container on this machine. The ClickStack UI is reached most directly at: +Rows in ClickHouse are **not** the same as the user seeing telemetry in ClickStack. The +ClickStack UI requires a one-time onboarding step that detects the data sources, and that step +fails if the ClickHouse service has idle-suspended in the meantime. So immediately before +sending the user to the console, **wake the service and keep it awake**: -``` -https://hyperdx.clickhouse.cloud/search?chcServiceId= +```bash +clickhousectl cloud service query --id "$SERVICE_ID" --query "SELECT 1" ``` -(There is also a Cloud-console route — `https://console.clickhouse.cloud/services//clickstack` — which redirects to the same UI after auth. Either works; prefer the direct HyperDX URL.) +Then tell the user to finish the onboarding in the ClickHouse Cloud console. Do not just say +"done", walk them through it: -Print a summary to the user, formatted exactly like this: +1. Go back to the [ClickHouse Cloud console](https://console.clickhouse.cloud) and open the + target service. +2. Select **ClickStack** from the left-hand menu. +3. Open **Getting Started** in the left-hand menu and complete the onboarding flow to detect + sources. +4. The sources are **auto-detected**, and logs and traces become available in ClickStack. + +The direct link is `https://console.clickhouse.cloud/services//clickstack`. + +**If source detection fails**, the service almost certainly idle-suspended between the data +send and the console step. Re-run the wake query above for the user, then have them re-run the +detection, rather than leaving them to debug an opaque failure. + +--- + +## Step 8: Summarize and hand off (without echoing secrets) + +Print a summary in exactly this format. Note the token is **referenced, not printed**, the +SQL password is not shown at all, and the collector keeps running until the user stops it: ``` -✅ ClickStack is set up and ingesting telemetry for service (). +✅ ClickStack is set up and ingesting telemetry for service (), + and the data sources are detected in ClickStack. Local OpenTelemetry collector - ▸ Container: clickstack-otel-collector (Docker) + ▸ Container: clickstack-otel-collector (Docker, network clickstack-net) ▸ Send OTLP gRPC to: localhost:4317 ▸ Send OTLP HTTP to: localhost:4318 - ▸ Required header: authorization: + ▸ Health check: http://localhost:13133/ + ▸ Required header: authorization: + (retrieve with: grep OTLP_AUTH_TOKEN /collector.env) ClickHouse target ▸ Endpoint: - ▸ SQL user: hyperdx_ingest + ▸ SQL user: hyperdx_ingest (password in /collector.env, mode 0600) ▸ Database: otel -Open ClickStack: - ▸ https://hyperdx.clickhouse.cloud/search?chcServiceId= +Finish in the ClickHouse Cloud console: + ▸ Open the service, select ClickStack in the left menu, then Getting Started, + and complete onboarding to auto-detect sources. + ▸ https://console.clickhouse.cloud/services//clickstack ``` Then tell the user, in your own words, that: -1. The collector keeps running until they stop the container — `docker stop clickstack-otel-collector` shuts it down; `docker start clickstack-otel-collector` brings it back. -2. Any application, SDK, or agent collector on this host can now send OTLP to `localhost:4317` (gRPC) or `localhost:4318` (HTTP), with the `authorization` header above. -3. Synthetic data is already flowing — they can open the HyperDX URL above to see logs / traces / metrics under the `otelgen` service. +1. All secrets live in `/collector.env` (mode `0600`). Nothing sensitive was pasted + into this chat or passed on a `docker run` command line. +2. The collector keeps running until they stop it: `docker stop clickstack-otel-collector` + shuts it down, `docker start clickstack-otel-collector` brings it back. +3. Any application, SDK, or agent on this host can now send OTLP to `localhost:4317` (gRPC) or + `localhost:4318` (HTTP) with the `authorization` header from the env file. +4. The ClickHouse Cloud service idle-suspends. If ClickStack later shows no recent data, the + service may simply be asleep; sending new telemetry or running any query wakes it. --- @@ -333,8 +493,10 @@ Then tell the user, in your own words, that: ```bash docker rm -f clickstack-otel-collector - +docker network rm clickstack-net 2>/dev/null || true clickhousectl cloud service query --id "$SERVICE_ID" --query "DROP USER IF EXISTS hyperdx_ingest" +# Optionally remove the local secrets once they are no longer needed: +# rm -f "$WORKDIR/collector.env" "$WORKDIR/svc.json" ``` -Do **not** drop the `otel` database — it contains telemetry the user may want to retain. +Do **not** drop the `otel` database: it contains telemetry the user may want to retain. From 3ac3f1732305ae3881129ccef924f3125bba0de9 Mon Sep 17 00:00:00 2001 From: Dale McDiarmid Date: Mon, 15 Jun 2026 13:30:54 +0100 Subject: [PATCH 2/4] improved skill --- .../clickstack/example-datasets/index.md | 1 + .../example-datasets/telemetrygen.md | 266 +++++++ ...setting-up-your-opentelemetry-collector.md | 114 ++- .../skills/clickstack-otel-collector/SKILL.md | 721 +++++++++++++----- 4 files changed, 867 insertions(+), 235 deletions(-) create mode 100644 docs/use-cases/observability/clickstack/example-datasets/telemetrygen.md diff --git a/docs/use-cases/observability/clickstack/example-datasets/index.md b/docs/use-cases/observability/clickstack/example-datasets/index.md index 93b266a444f..9a957a48b29 100644 --- a/docs/use-cases/observability/clickstack/example-datasets/index.md +++ b/docs/use-cases/observability/clickstack/example-datasets/index.md @@ -18,4 +18,5 @@ This section provides various sample datasets and examples to help you get start | [Session Replay Demo](session-replay.md) | Instrument a demo web application for session replay and view your interactions in ClickStack | | [Chrome Extension](chrome-extension.md) | Inject the Browser SDK into any website using the HyperDX Chrome extension, no application code changes required | | [Synthetic data with otelgen](otelgen.md) | Use `otelgen` to send synthetic logs, traces and metrics to a running ClickStack OpenTelemetry collector | +| [Synthetic data with telemetrygen](telemetrygen.md) | Use `telemetrygen` to send diverse synthetic logs, traces and metrics, shaped with flags across services, severities, span statuses and metric types, to a running ClickStack OpenTelemetry collector | | [HackerNews Analyzer](instrument-application.md) | Instrument the HackerNews Analyzer, a Node.js application, with OpenTelemetry and send its logs, metrics, and traces to Managed ClickStack | diff --git a/docs/use-cases/observability/clickstack/example-datasets/telemetrygen.md b/docs/use-cases/observability/clickstack/example-datasets/telemetrygen.md new file mode 100644 index 00000000000..b55913e395e --- /dev/null +++ b/docs/use-cases/observability/clickstack/example-datasets/telemetrygen.md @@ -0,0 +1,266 @@ +--- +slug: /use-cases/observability/clickstack/getting-started/telemetrygen +title: 'Generate synthetic OpenTelemetry data with telemetrygen' +sidebar_label: 'Synthetic data with telemetrygen' +sidebar_position: 5 +pagination_prev: null +pagination_next: null +description: 'Use telemetrygen to send diverse synthetic logs, traces and metrics to a ClickStack OpenTelemetry collector' +doc_type: 'guide' +toc_max_heading_level: 2 +keywords: ['clickstack', 'telemetrygen', 'synthetic data', 'OpenTelemetry', 'test', 'logs', 'traces', 'metrics', 'observability'] +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +[`telemetrygen`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/cmd/telemetrygen) is the OpenTelemetry Collector Contrib data generator. It emits synthetic OTLP logs, traces and metrics, and exposes flags that let you shape the data: multiple services, log severities, span statuses and child spans, and different metric types. Use it to confirm that a ClickStack OpenTelemetry collector is accepting data and that varied, realistic events surface in the ClickStack UI. + +This guide assumes the collector is already running with OTLP endpoints on `4317` (gRPC) and `4318` (HTTP). + + + + + + +### Prerequisites {#prerequisites-managed} + +This guide assumes you have completed the [Getting Started Guide for Managed ClickStack](/use-cases/observability/clickstack/deployment/clickstack-clickhouse-cloud) and have an OpenTelemetry collector running with the OTLP gRPC (`4317`) and HTTP (`4318`) endpoints reachable from the machine you run `telemetrygen` on. If you [secured the collector](/use-cases/observability/clickstack/ingesting-data/otel-collector#securing-the-collector) with an `OTLP_AUTH_TOKEN`, keep that value handy. + +### Install telemetrygen {#install-telemetrygen-managed} + +Run `telemetrygen` from its Docker image (no install required). Define a small wrapper so the commands below stay readable; `--add-host` lets the container reach a collector listening on the host: + +```shell +telemetrygen() { + docker run --rm --add-host=host.docker.internal:host-gateway \ + ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest "$@" +} +export OTEL_ENDPOINT=host.docker.internal:4317 +``` + +Or install the binary with Go and target `localhost` instead: + +```shell +go install github.com/open-telemetry/opentelemetry-collector-contrib/cmd/telemetrygen@latest +export OTEL_ENDPOINT=localhost:4317 +``` + +### Set environment variables {#set-env-vars-managed} + +Export the auth token if the collector is secured: + +```shell +export OTLP_AUTH_TOKEN= +``` + +:::note[Unsecured collector] +The ClickStack OpenTelemetry collector is unauthenticated by default. If you haven't followed [Securing the collector](/use-cases/observability/clickstack/ingesting-data/otel-collector#securing-the-collector) to set an `OTLP_AUTH_TOKEN`, drop the `--otlp-header` flag from the commands below. +::: + +### Generate logs {#generate-logs-managed} + +Send logs from two services with different severities, bodies, and attributes, so the `Search` view has both informational and error events to filter on: + +```shell +telemetrygen logs \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service checkout --rate 5 --duration 30s \ + --severity-text Info --severity-number 9 --body "checkout completed" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.method="POST"' + +telemetrygen logs \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service payment --rate 5 --duration 30s \ + --severity-text Error --severity-number 17 --body "payment gateway timeout" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.status_code="500"' +``` + +The most useful log flags: + +- `--service` sets `service.name` so events are attributable to a service. +- `--severity-text` and `--severity-number` set the level (`severity-number` ranges from 1 to 24). +- `--body` sets the log message. +- `--otlp-attributes` sets resource-level attributes (`key="value"`, `key=true`, or `key=`). +- `--telemetry-attributes` sets per-record attributes. + +### Generate traces {#generate-traces-managed} + +Send multi-span traces, one healthy service and one returning errors. The child spans and error status populate the Service Map and the error views: + +```shell +telemetrygen traces \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service checkout --rate 5 --duration 30s --workers 2 \ + --child-spans 4 --span-duration 120ms --span-links 1 --status-code Ok \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.route="/cart"' + +telemetrygen traces \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service payment --rate 5 --duration 30s \ + --child-spans 3 --span-duration 400ms --status-code Error \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.route="/charge"' +``` + +The most useful trace flags: + +- `--child-spans` generates that many child spans per trace, giving each trace real depth. +- `--span-duration` sets how long each span lasts (for example `120ms`, `2s`). +- `--status-code` is one of `Unset`, `Error`, `Ok` (or `0`, `1`, `2`). Use `Error` to exercise error views. +- `--span-links` adds links between spans. +- `--workers` runs several generators in parallel for a higher, more varied volume. + +### Generate metrics {#generate-metrics-managed} + +Send the three common metric types so dashboards have counters, gauges, and a distribution. Unlike some generators, `telemetrygen` honors `--duration` for metrics, so no manual stop is needed: + +```shell +telemetrygen metrics --metric-type Sum \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service checkout --otlp-metric-name http.server.requests \ + --aggregation-temporality cumulative --rate 5 --duration 30s + +telemetrygen metrics --metric-type Gauge \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service checkout --otlp-metric-name system.memory.usage \ + --rate 5 --duration 30s + +telemetrygen metrics --metric-type Histogram \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service payment --otlp-metric-name http.server.duration \ + --rate 5 --duration 30s +``` + +`--metric-type` accepts `Gauge`, `Sum`, `Histogram`, or `ExponentialHistogram`. `--otlp-metric-name` names the series so you can find it in the UI, and `--aggregation-temporality` is `delta` or `cumulative`. + +### Verify in ClickStack {#verify-managed} + +Open the ClickStack UI from the ClickHouse Cloud console. In the `Search` view, set the time range to `Last 15 minutes` and switch the source between `Logs` and `Traces`. Filter on `ServiceName` to see the `checkout` and `payment` services, and on `SeverityText` to find the `Error` log line. Open a `payment` trace to see the child spans and the error status. Open the `Chart Explorer`, select `Metrics`, and chart one of the metric names you set above (for example `http.server.requests`) to verify metrics ingestion. + + + + + + + + +### Prerequisites {#prerequisites-oss} + +This guide assumes you have started Open Source ClickStack using the [instructions for the all-in-one image](/use-cases/observability/clickstack/getting-started/oss), and that the OTLP endpoints (`4317` gRPC and `4318` HTTP) are reachable. You also need the ingestion API key from the HyperDX UI under `Team Settings > API Keys`. + +### Install telemetrygen {#install-telemetrygen-oss} + +Run `telemetrygen` from its Docker image (no install required). Define a small wrapper so the commands below stay readable; `--add-host` lets the container reach a collector listening on the host: + +```shell +telemetrygen() { + docker run --rm --add-host=host.docker.internal:host-gateway \ + ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest "$@" +} +export OTEL_ENDPOINT=host.docker.internal:4317 +``` + +Or install the binary with Go and target `localhost` instead: + +```shell +go install github.com/open-telemetry/opentelemetry-collector-contrib/cmd/telemetrygen@latest +export OTEL_ENDPOINT=localhost:4317 +``` + +### Set environment variables {#set-env-vars-oss} + +Export the ingestion API key: + +```shell +export CLICKSTACK_API_KEY= +``` + +### Generate logs {#generate-logs-oss} + +Send logs from two services with different severities, bodies, and attributes: + +```shell +telemetrygen logs \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ + --service checkout --rate 5 --duration 30s \ + --severity-text Info --severity-number 9 --body "checkout completed" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.method="POST"' + +telemetrygen logs \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ + --service payment --rate 5 --duration 30s \ + --severity-text Error --severity-number 17 --body "payment gateway timeout" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.status_code="500"' +``` + +### Generate traces {#generate-traces-oss} + +Send multi-span traces, one healthy service and one returning errors: + +```shell +telemetrygen traces \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ + --service checkout --rate 5 --duration 30s --workers 2 \ + --child-spans 4 --span-duration 120ms --span-links 1 --status-code Ok \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.route="/cart"' + +telemetrygen traces \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ + --service payment --rate 5 --duration 30s \ + --child-spans 3 --span-duration 400ms --status-code Error \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.route="/charge"' +``` + +### Generate metrics {#generate-metrics-oss} + +Send the three common metric types: + +```shell +telemetrygen metrics --metric-type Sum \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ + --service checkout --otlp-metric-name http.server.requests \ + --aggregation-temporality cumulative --rate 5 --duration 30s + +telemetrygen metrics --metric-type Gauge \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ + --service checkout --otlp-metric-name system.memory.usage \ + --rate 5 --duration 30s + +telemetrygen metrics --metric-type Histogram \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ + --service payment --otlp-metric-name http.server.duration \ + --rate 5 --duration 30s +``` + +`--metric-type` accepts `Gauge`, `Sum`, `Histogram`, or `ExponentialHistogram`. + +### Verify in ClickStack {#verify-oss} + +Visit [http://localhost:8080](http://localhost:8080) to open the ClickStack UI. In the `Search` view, set the time range to `Last 15 minutes` and switch the source between `Logs` and `Traces`. Filter on `ServiceName` to see the `checkout` and `payment` services, and on `SeverityText` to find the `Error` log line. Open a `payment` trace to see the child spans and the error status. Open the `Chart Explorer`, select `Metrics`, and chart one of the metric names you set above (for example `http.server.requests`) to verify metrics ingestion. + + + + + diff --git a/docs/use-cases/observability/clickstack/managed-onboarding/setting-up-your-opentelemetry-collector.md b/docs/use-cases/observability/clickstack/managed-onboarding/setting-up-your-opentelemetry-collector.md index 184746697fc..9549fa8110a 100644 --- a/docs/use-cases/observability/clickstack/managed-onboarding/setting-up-your-opentelemetry-collector.md +++ b/docs/use-cases/observability/clickstack/managed-onboarding/setting-up-your-opentelemetry-collector.md @@ -3,7 +3,7 @@ slug: /use-cases/observability/clickstack/setting-up-your-opentelemetry-collecto title: 'Setting up your OpenTelemetry Collector' description: 'Setting up an OpenTelemetry Collector for Managed ClickStack' doc_type: 'guide' -keywords: ['clickstack', 'opentelemetry', 'collector', 'managed', 'observability', 'gateway', 'otelgen'] +keywords: ['clickstack', 'opentelemetry', 'collector', 'managed', 'observability', 'gateway', 'telemetrygen'] unlisted: true pagination_prev: null pagination_next: null @@ -93,34 +93,60 @@ For production, we recommend enabling TLS on the OTLP endpoint. See [Securing th ## Verify the endpoint {#verify-the-endpoint} -Generate some synthetic traffic against the collector to confirm the full pipeline works. We use [`otelgen`](https://github.com/krzko/otelgen), a small CLI that emits OTLP logs, traces, and metrics. +Generate some synthetic traffic against the collector to confirm the full pipeline works. We use [`telemetrygen`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/cmd/telemetrygen), the OpenTelemetry Collector Contrib generator, which emits OTLP logs, traces, and metrics and exposes flags to shape the data across services, severities, span statuses, and metric types. -Install `otelgen` with Homebrew: +Run it from its Docker image (no install required). Define a small wrapper so the commands below stay readable; `--add-host` lets the container reach a collector listening on the host: ```shell -brew install krzko/tap/otelgen +telemetrygen() { + docker run --rm --add-host=host.docker.internal:host-gateway \ + ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest "$@" +} +export OTEL_ENDPOINT=host.docker.internal:4317 ``` -Or with Go: +Or install the binary with Go and target `localhost` instead: ```shell -go install github.com/krzko/otelgen@latest +go install github.com/open-telemetry/opentelemetry-collector-contrib/cmd/telemetrygen@latest +export OTEL_ENDPOINT=localhost:4317 ``` -Send a short burst of logs to the collector: +Send logs tagged with a service, environment, and severity: ```shell - otelgen \ - --otel-exporter-otlp-endpoint localhost:4317 \ - --insecure \ - --protocol grpc \ - --header "authorization=${OTLP_AUTH_TOKEN}" \ - --rate 5 \ - --duration 60 \ - logs multi +telemetrygen logs \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service checkout --rate 5 --duration 30s \ + --severity-text Error --severity-number 17 --body "payment gateway timeout" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.status_code="500"' ``` -For the equivalent trace and metrics commands, and a walkthrough of the other `otelgen` subcommands, see [Synthetic data with otelgen](/use-cases/observability/clickstack/getting-started/otelgen). +Send multi-span traces with child spans and an error status, which populate the Service Map and error views: + +```shell +telemetrygen traces \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service checkout --rate 5 --duration 30s \ + --child-spans 4 --span-duration 120ms --span-links 1 --status-code Error \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.route="/cart"' +``` + +Send metrics of a given type with a named series: + +```shell +telemetrygen metrics --metric-type Sum \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service checkout --otlp-metric-name http.server.requests \ + --aggregation-temporality cumulative --rate 5 --duration 30s +``` + +For the full set of flags, variations across multiple services and metric types, and verification tips, see [Synthetic data with telemetrygen](/use-cases/observability/clickstack/getting-started/telemetrygen). ## Confirm in the ClickStack UI {#confirm-in-ui} @@ -246,34 +272,60 @@ For further details on configuring OpenTelemetry collectors against Managed Clic ## Verify the endpoint {#verify-the-endpoint-existing} -Generate some synthetic traffic against your collector to confirm the full pipeline works. We use [`otelgen`](https://github.com/krzko/otelgen), a small CLI that emits OTLP logs, traces, and metrics. +Generate some synthetic traffic against your collector to confirm the full pipeline works. We use [`telemetrygen`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/cmd/telemetrygen), the OpenTelemetry Collector Contrib generator, which emits OTLP logs, traces, and metrics and exposes flags to shape the data across services, severities, span statuses, and metric types. + +Run it from its Docker image (no install required). Substitute `` with the host your collector listens on, and set the `authorization` header (or alternative auth method) to whatever your collector expects: + +```shell +telemetrygen() { + docker run --rm --add-host=host.docker.internal:host-gateway \ + ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest "$@" +} +export OTEL_ENDPOINT=:4317 +export OTLP_AUTH_TOKEN= +``` + +Or install the binary with Go: + +```shell +go install github.com/open-telemetry/opentelemetry-collector-contrib/cmd/telemetrygen@latest +``` -Install `otelgen` with Homebrew: +Send logs tagged with a service, environment, and severity: ```shell -brew install krzko/tap/otelgen +telemetrygen logs \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service checkout --rate 5 --duration 30s \ + --severity-text Error --severity-number 17 --body "payment gateway timeout" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.status_code="500"' ``` -Or with Go: +Send multi-span traces with child spans and an error status, which populate the Service Map and error views: ```shell -go install github.com/krzko/otelgen@latest +telemetrygen traces \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service checkout --rate 5 --duration 30s \ + --child-spans 4 --span-duration 120ms --span-links 1 --status-code Error \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.route="/cart"' ``` -Send a short burst of logs to your collector. Substitute `` with the host your collector listens on, and set the `authorization` header (or alternative auth method) to whatever your collector expects: +Send metrics of a given type with a named series: ```shell - otelgen \ - --otel-exporter-otlp-endpoint :4317 \ - --insecure \ - --protocol grpc \ - --header "authorization=" \ - --rate 5 \ - --duration 60 \ - logs multi +telemetrygen metrics --metric-type Sum \ + --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ + --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ + --service checkout --otlp-metric-name http.server.requests \ + --aggregation-temporality cumulative --rate 5 --duration 30s ``` -For the equivalent trace and metrics commands, and a walkthrough of the other `otelgen` subcommands, see [Synthetic data with otelgen](/use-cases/observability/clickstack/getting-started/otelgen). +For the full set of flags, variations across multiple services and metric types, and verification tips, see [Synthetic data with telemetrygen](/use-cases/observability/clickstack/getting-started/telemetrygen). ## Confirm in the ClickStack UI {#confirm-in-ui-existing} diff --git a/static/skills/clickstack-otel-collector/SKILL.md b/static/skills/clickstack-otel-collector/SKILL.md index 10ec87dd0df..c6fdf71a522 100644 --- a/static/skills/clickstack-otel-collector/SKILL.md +++ b/static/skills/clickstack-otel-collector/SKILL.md @@ -1,80 +1,122 @@ --- name: clickstack-otel-collector -description: Use when a user wants to stand up a local Docker OpenTelemetry collector for Managed ClickStack on a ClickHouse Cloud service, send logs/traces/metrics, and verify the data is visible in ClickStack. +description: Use when a user wants to wire an OpenTelemetry collector into a Managed ClickStack service on ClickHouse Cloud, either by deploying a new local collector (Docker run or Docker Compose) or by configuring their own existing collector, then send rich synthetic telemetry and verify it is visible in ClickStack. license: Apache-2.0 metadata: author: ClickHouse Inc - version: "0.2.0" + version: "0.5.0" --- -# Set up the ClickStack OpenTelemetry collector (local Docker) +# Set up an OpenTelemetry collector for Managed ClickStack -This skill wires a **local Docker** OpenTelemetry collector into a Managed ClickStack -service running on ClickHouse Cloud, sends synthetic telemetry through it, and confirms -the data is actually visible in ClickStack. It uses -[`clickhousectl`](https://clickhouse.com/docs/interfaces/cli) for all cloud and SQL -operations. +This skill wires an OpenTelemetry collector into a Managed ClickStack service running on +ClickHouse Cloud, sends rich synthetic telemetry through it, and confirms the data is actually +visible in ClickStack. It uses [`clickhousectl`](https://clickhouse.com/docs/interfaces/cli) +for all cloud and SQL operations. -**Scope.** This is deliberately the *local Docker collector* path. It is one of several -things a user might want: they may already run a collector and only need exporter config, -or they may want a Kubernetes deployment with secrets in K8s Secrets. Those are out of -scope here. If the user clearly wants one of those instead, say so and stop, rather than -building a local container they did not ask for. +**Scope.** This skill supports two paths, chosen in Step 0: + +1. **Deploy a new collector** locally. You can do this **two ways: individual `docker` commands, + or a `docker compose` file** (recommended, fewer commands and one file to start/stop). Make the + user aware of both up front and let them pick in Step 0; do not assume plain `docker`. Either + way runs the ClickStack distribution of the collector, preconfigured for Managed ClickStack. +2. **Configure your own existing collector** by adding the ClickHouse exporter configuration. + We give you the exact config to drop in; you reload your collector. Use this if you already + run a collector in a **gateway** role. + +A full Kubernetes deployment (Helm, secrets in K8s Secrets) is out of scope here; the config +we generate in path 2 can be applied to a collector running anywhere. The end state is: -- A dedicated `hyperdx_ingest` SQL user on the target service, with exactly the grants - this collector image needs (including the `default.*` grant its migrations require). -- A ClickStack-distribution OpenTelemetry collector running locally as a Docker container, - accepting OTLP on `4317`/`4318`, exposing a health endpoint on `13133`, and writing into - the `otel` database on the target service. -- Synthetic telemetry exercising the logs, traces, and metrics pipelines. -- The service confirmed **awake**, and the user pointed to the ClickHouse Cloud console to - complete ClickStack onboarding (Getting Started, auto-detect sources) so they can actually - *see* their data. +- A dedicated `hyperdx_ingest` SQL user on the target service, with exactly the grants the + collector needs (it creates the `otel.*` schema on first write). +- A collector forwarding logs, traces, and metrics into the `otel` database on the service, + either the new local ClickStack collector or your existing one. +- Rich synthetic telemetry across several services, severities, span statuses, and metric + types, so ClickStack's Search, Service Map, and dashboards have something real to show. +- The service confirmed **awake**, and the user walked through the ClickStack onboarding in the + Cloud console so they can actually *see* their data. -Secrets (the OTLP auth token and the SQL password) are generated locally, written **once** -to a `0600` env file, and passed to Docker via `--env-file`. They are never pasted into the -chat, never passed with `docker run -e`, and never echoed back after creation. +Secrets (the OTLP auth token and the SQL password) are generated locally, written **once** to a +`0600` env file, and passed to Docker via `--env-file`. They are never pasted into the chat, +never passed with `docker run -e`, and never echoed back after creation. Follow these steps in order. Each step depends on state established by the previous one. --- -## Step 0: Batch the permissions up front +## Step 0: Choose your path + +Ask the user two short questions before doing anything else, because they determine which later +steps run. + +**Question 1: Do you already have an OpenTelemetry collector running in a gateway role?** + +- **No, set one up for me.** -> the **new-collector** path. Continue to Question 2. +- **Yes, I have one.** -> the **existing-collector** path. Skip Question 2 (it does not apply), + and in Step 6 you will configure their collector rather than deploy a new one. + +**Question 2 (new-collector path only): Run the collector with individual Docker commands, or a +Docker Compose file?** + +- **Docker Compose (recommended).** Fewer commands, one file to start and stop, easiest to + re-run. Best if `docker compose` is available. +- **Individual Docker commands.** Use if Compose is not installed or you prefer explicit + commands. + +Record the answers as `COLLECTOR_PATH` (`new` or `existing`) and, for the new path, +`DEPLOY_MODE` (`compose` or `run`). Refer back to them in Step 6 and Step 7. + +--- + +## Step 1: Batch the permissions up front Coding agents prompt for approval the first time they see each shell command. To avoid -interrupting the user every few steps, ask them once, up front, to allowlist these command -prefixes (the "always allow for this project / session" option in their agent). There are no -destructive operations and nothing targets anything outside this project or their ClickHouse -Cloud service: - -| Command prefix | Used for | -| --- | --- | -| `openssl rand …` | generate the OTLP token and SQL password | -| `clickhousectl cloud …` | auth status, resolve the service, run SQL via the Query API | -| `docker …` | run/inspect the collector and the telemetry generator | -| `curl …` | one local health check against `localhost:13133` | -| `jq …` | parse JSON from `clickhousectl` | - -Tell the user, in your own words: *"If your agent supports it, choose 'always allow' for -each of these the first time it asks. The whole run is read-only against your machine except -for one Docker container, and write operations against ClickHouse are limited to creating the -ingest user and the `otel` schema."* +interrupting the user every few steps, ask them once, up front, to allowlist the command +prefixes below (the "always allow for this project / session" option in their agent). There are +no destructive operations and nothing targets anything outside this project or their ClickHouse +Cloud service. + +| Command prefix | Used for | Needed when | +| --- | --- | --- | +| `openssl rand …` | generate the OTLP token and SQL password | always | +| `clickhousectl cloud …` | auth, resolve the service, run SQL via the Query API | always | +| `jq …` | parse JSON from `clickhousectl` | always | +| `docker …` / `docker compose …` | run/inspect the collector and the telemetry generator | new-collector path, and the optional telemetry check | +| `curl …` | local health check against `localhost:13133` (and installing `clickhousectl` if missing) | new-collector path | + +Tell the user, in your own words: *"If your agent supports it, choose 'always allow' for each +of these the first time it asks. The whole run is read-only against your machine except for the +collector container, and write operations against ClickHouse are limited to creating the ingest +user and the `otel` schema."* + +If the user is on the existing-collector path and does not want to run the optional telemetry +check, you can drop `docker` and `curl` from the list. + +**Two approvals are semantic, not prefix-based, so allowlisting won't pre-clear them.** Warn the +user to expect these and approve them explicitly when they appear: + +- The `clickhousectl` install in Step 3 uses `curl … | sh`, which many agent sandboxes flag as + "downloading and running untrusted code" regardless of any `curl` allowlist rule. +- The `CREATE USER` / `GRANT` in Step 5 may be flagged as "modifying shared production + infrastructure," again independent of the `clickhousectl` prefix rule. + +Neither is solved by the table above; they are one-time, intentional, and safe to approve. Then continue. --- -## Step 1: Confirm the target service and lay down the secrets file +## Step 2: Confirm the target service and lay down the secrets file -The user's prompt contains a service identifier, either a service ID (UUID) or a service -name. Treat that value as `SERVICE_REF`. +The user's prompt contains a service identifier, either a service ID (UUID) or a service name. +Treat that value as `SERVICE_REF`. Create a working directory and a **`0600` env file** that will hold all configuration and -secrets for this run. The key names match exactly what the collector image reads, so this -same file is passed straight to `docker run --env-file` in Step 5. Write it under a tight -`umask` so the secret is never briefly world-readable: +secrets for this run. The key names match exactly what the collector image reads, so this same +file is passed straight to `docker run --env-file` (or referenced by Compose) in Step 6. Write +it under a tight `umask` so the secret is never briefly world-readable: ```bash WORKDIR="${WORKDIR:-$HOME/clickstack-otel-collector}" @@ -98,21 +140,25 @@ ls -l "$ENV_FILE" Two things about these values matter and are easy to get wrong: - **Key names are exact.** The collector reads `CLICKHOUSE_USER`, `CLICKHOUSE_PASSWORD`, - `CLICKHOUSE_ENDPOINT`, and `HYPERDX_OTEL_EXPORTER_CLICKHOUSE_DATABASE`. Store the SQL - password under `CLICKHOUSE_PASSWORD` (not a custom name); if it is missing, the collector - starts with an **empty** password and dies with `code: 516, Authentication failed`. -- **The password charset is constrained from three directions at once.** ClickHouse Cloud - rejects passwords without at least one uppercase character and one special character, so a - plain hex string fails at `CREATE USER`. At the same time, the collector's migration tool - embeds the password in a connection URL, so `@`, `:`, `/`, `?`, `#`, and `%` corrupt it - (symptom: `code: 516` at startup even though the password is "correct"). The recipe above - is random hex (lowercase + digits) plus the suffix `Aa1-`, which adds the required - uppercase, a digit, and a **URL-unreserved** special character (`-`). The OTLP token has no - such rules (it is just a bearer token), so plain hex is fine for it. - -The env file uses **bare `KEY=VALUE` lines with no quotes**: Docker's `--env-file` does not -do shell parsing, so any quotes you add become part of the value. The charset above needs no -quoting anywhere (SQL, the env file, or `"$VAR"` in the shell). + `CLICKHOUSE_ENDPOINT`, and `HYPERDX_OTEL_EXPORTER_CLICKHOUSE_DATABASE`. Store the SQL password + under `CLICKHOUSE_PASSWORD` (not a custom name); if it is missing, the collector starts with an + **empty** password and dies with `code: 516, Authentication failed`. +- **The password charset is constrained from three directions at once.** ClickHouse Cloud rejects + passwords without at least one uppercase character and one special character, so a plain hex + string fails at `CREATE USER`. At the same time, the collector's migration tool embeds the + password in a connection URL, so `@`, `:`, `/`, `?`, `#`, and `%` corrupt it (symptom: + `code: 516` at startup even though the password is "correct"). The recipe above is random hex + (lowercase + digits) plus the suffix `Aa1-`, which adds the required uppercase, a digit, and a + **URL-unreserved** special character (`-`). The OTLP token has no such rules (it is just a + bearer token), so plain hex is fine for it. + +The env file uses **bare `KEY=VALUE` lines with no quotes**: Docker's `--env-file` does not do +shell parsing, so any quotes you add become part of the value. + +On the **existing-collector path** the `OTLP_AUTH_TOKEN` is not used by your collector (auth on +your receiver is your own setup); it is generated only so the same file works if you later switch +to the local collector. The `CLICKHOUSE_*` values are still used: they go into the exporter +config you add to your collector in Step 6. From now on, load the file when you need a value instead of typing secrets: @@ -120,23 +166,31 @@ From now on, load the file when you need a value instead of typing secrets: set -a; . "$ENV_FILE"; set +a ``` -**Confirm with the user** that `SERVICE_REF` is correct. Tell them the working directory and -that `collector.env` (mode `0600`) now holds the OTLP token and the SQL password. Do **not** -print either secret. If they want to see a value, point them at the file +**Confirm with the user** that `SERVICE_REF` is correct. Tell them the working directory and that +`collector.env` (mode `0600`) now holds the OTLP token and the SQL password. Do **not** print +either secret. If they want to see a value, point them at the file (`grep OTLP_AUTH_TOKEN "$ENV_FILE"`). If the user supplied their own token or password, write those into the file instead of the -generated ones, but keep the same `0600` discipline and make sure any custom password still -meets the charset rules above. +generated ones, but keep the same `0600` discipline and make sure any custom password still meets +the charset rules above. --- -## Step 2: Authenticate `clickhousectl` (separate terminal by default) +## Step 3: Authenticate `clickhousectl` (separate terminal by default) -Check `clickhousectl` is on `PATH`: +Check `clickhousectl` is on `PATH`. Run this presence check **on its own**, not chained to the +installer: the `|| curl … | sh` form drags a harmless check into a compound command that sandboxes +deny wholesale as an untrusted-code download. ```bash -which clickhousectl || curl -fsSL https://clickhouse.com/cli | sh +which clickhousectl +``` + +Only if that prints nothing, install it (the user may need to approve this explicitly, see Step 1): + +```bash +curl -fsSL https://clickhouse.com/cli | sh ``` Check authentication: @@ -145,19 +199,19 @@ Check authentication: clickhousectl cloud auth status ``` -This skill needs **API key authentication**: OAuth is read-only and cannot create users or -run write queries. If the `API key` row is not `Active`, the user must authenticate. +This skill needs **API key authentication**: OAuth is read-only and cannot create users or run +write queries. If the `API key` row is not `Active`, the user must authenticate. -**Do not ask the user to paste their API key and secret into the chat.** Anything pasted -into the conversation lives in the transcript and has to be rotated afterward. Instead, ask -them to authenticate in a **separate terminal**, then tell you when they are done: +**Do not ask the user to paste their API key and secret into the chat.** Anything pasted into the +conversation lives in the transcript and has to be rotated afterward. Instead, ask them to +authenticate in a **separate terminal**, then tell you when they are done: > I need a ClickHouse Cloud **Admin** API key to create the ingest user and verify the data. > Please don't paste it here. Instead: > > 1. In the [Cloud console](https://console.clickhouse.cloud), open **Organization → API keys -> → New API key**, and give it the **Admin** role. (Developer-scoped keys can't provision -> the per-service Query API endpoint that `cloud service query` uses.) +> → New API key**, and give it the **Admin** role. (Developer-scoped keys can't provision the +> per-service Query API endpoint that `cloud service query` uses.) > 2. In a **separate terminal**, run: > > ```bash @@ -166,35 +220,54 @@ them to authenticate in a **separate terminal**, then tell you when they are don > > 3. Tell me when that's done and I'll re-check the auth status. -Poll until the API key row reports `Active`, then confirm with a real privileged call rather -than trusting the status table alone: +Poll until the API key row reports `Active`, then confirm with a real privileged call rather than +trusting the status table alone. Use a **ref-agnostic** call here: `SERVICE_REF` may be a name, and +`cloud service get` only accepts a UUID, so confirming with `get` would fail on a name for reasons +unrelated to auth. `cloud service list` needs no ref and proves the API key works: ```bash clickhousectl cloud auth status -clickhousectl cloud service get "$SERVICE_REF" --json | jq -r '.name, .state' +clickhousectl cloud service list --json | jq -r '.[].name' ``` -If the service resolves, you are authenticated; continue. +If the list returns your services, you are authenticated; continue. The actual name-or-UUID +resolution of `SERVICE_REF` happens in Step 4. -**Fallback if a separate-terminal login isn't picked up.** Some `clickhousectl` builds save -the credentials file but a freshly spawned shell (such as the one your tool calls run in) -doesn't read it, so `auth status` keeps showing `Not configured`. In that case, load the -saved credentials into the environment for your own calls, without printing them: +**Expect to need the env-var credentials (common, not an edge case).** Many `clickhousectl` builds +save the credentials file but a freshly spawned shell (such as the one your tool calls run in) +doesn't read it, so `auth status` shows `Active` yet the very next `clickhousectl` call reports +`No credentials found`. Rather than treat this as a rare fallback, write a small **sourceable +creds file** once, then load it in every later shell. This keeps each subsequent shell to a single +`.` line instead of two `jq` re-derivations, and keeps the secret out of the chat: ```bash -export CLICKHOUSE_CLOUD_API_KEY="$(jq -r .api_key "$HOME/.clickhouse/credentials.json")" -export CLICKHOUSE_CLOUD_API_SECRET="$(jq -r .api_secret "$HOME/.clickhouse/credentials.json")" +# Write a private, sourceable creds file next to collector.env. +( umask 177 + { echo "export CLICKHOUSE_CLOUD_API_KEY=$(jq -r .api_key "$HOME/.clickhouse/credentials.json")" + echo "export CLICKHOUSE_CLOUD_API_SECRET=$(jq -r .api_secret "$HOME/.clickhouse/credentials.json")" + } > "$WORKDIR/creds.env" +) +chmod 600 "$WORKDIR/creds.env" ``` -Re-run the `service get` check above; the `Env vars` row should now read `Active`. This still -keeps the secret out of the chat. Do not continue until a real call succeeds. +**From now on, open every shell that calls `clickhousectl` with both loads**, because env vars do +not persist across shells: + +```bash +. "$WORKDIR/creds.env"; set -a; . "$ENV_FILE"; set +a +``` + +Re-run the `service list` check above with the creds loaded; it should now succeed. Do not continue +until a real call works. (If `clickhousectl auth status` already shows `API key … Active` and calls +succeed without `creds.env`, you can skip this; but most agent shells need it.) --- -## Step 3: Resolve the service and capture the HTTPS endpoint +## Step 4: Resolve the service and capture the HTTPS endpoint -Load the env file (`set -a; . "$ENV_FILE"; set +a`) and resolve the service. If `SERVICE_REF` -is a UUID, use it directly; otherwise look it up by name: +Open the shell with the combined load so credentials and config are both present +(`[ -f "$WORKDIR/creds.env" ] && . "$WORKDIR/creds.env"; set -a; . "$ENV_FILE"; set +a`), then +resolve the service. If `SERVICE_REF` is a UUID, use it directly; otherwise look it up by name: ```bash # UUID form @@ -206,9 +279,9 @@ clickhousectl cloud service list --json \ | jq --arg n "$SERVICE_REF" '.[] | select(.name==$n)' > "$WORKDIR/svc.json" ``` -Extract the values you need, coercing the port to an integer. The port serializes as a -float (`8443.0`); if `:8443.0` leaks into the endpoint the collector's ClickHouse exporter -cannot dial it: +Extract the values you need, coercing the port to an integer. The port serializes as a float +(`8443.0`); if `:8443.0` leaks into the endpoint the collector's ClickHouse exporter cannot dial +it: ```bash SERVICE_ID=$(jq -r '.id' "$WORKDIR/svc.json") @@ -218,45 +291,60 @@ CLICKHOUSE_ENDPOINT=$(jq -r '.endpoints[] | select(.protocol=="https") | "https://\(.host):\(.port | tonumber | floor)"' "$WORKDIR/svc.json") # Persist the resolved values back into the env file for later steps and docker --env-file. -{ echo "SERVICE_ID=$SERVICE_ID" - echo "CLICKHOUSE_ENDPOINT=$CLICKHOUSE_ENDPOINT" -} >> "$ENV_FILE" +# Append only if the key is not already present, so a second run does not duplicate lines. +grep -q '^SERVICE_ID=' "$ENV_FILE" || echo "SERVICE_ID=$SERVICE_ID" >> "$ENV_FILE" +grep -q '^CLICKHOUSE_ENDPOINT=' "$ENV_FILE" || echo "CLICKHOUSE_ENDPOINT=$CLICKHOUSE_ENDPOINT" >> "$ENV_FILE" printf 'service=%q state=%s endpoint=%s\n' "$SERVICE_NAME" "$STATE" "$CLICKHOUSE_ENDPOINT" ``` -`STATE` must be `running`. If it is `stopped` or `starting`, ask the user to start the -service (or wait), and do not proceed. ClickHouse Cloud services **idle-suspend**, so even a -"running" service can be asleep; the next query both checks reachability and wakes it: +`STATE` must be `running`. If it is `stopped` or `starting`, ask the user to start the service (or +wait), and do not proceed. ClickHouse Cloud services **idle-suspend**, so even a "running" service +can be asleep; the next query both checks reachability and wakes it: ```bash clickhousectl cloud service query --id "$SERVICE_ID" --query "SELECT version()" ``` A successful response confirms the service is awake and that the per-service Query API key is -provisioned. On the first call `clickhousectl` prints `Provisioning Query API endpoint + key -for service ''...`, which is expected. +provisioned. On the first call `clickhousectl` prints `Provisioning Query API endpoint + key for +service ''...`, which is expected. --- -## Step 4: Create the `hyperdx_ingest` SQL user and grant it `otel.*` +## Step 5: Create the `hyperdx_ingest` SQL user and grant it `otel.*` -The user name is fixed and the password charset (Step 1) needs no escaping, so single-quoting -it in SQL is safe. Load the env file first so `$CLICKHOUSE_PASSWORD` is set: +This step is the same on both paths: the collector (new or existing) authenticates to ClickHouse +as `hyperdx_ingest`. The user name is fixed and the password charset (Step 2) needs no escaping, so +single-quoting it in SQL is safe. Open the shell with the combined load so `$CLICKHOUSE_PASSWORD` +(and credentials) are set. -```bash -set -a; . "$ENV_FILE"; set +a +> **Expect an approval prompt here.** The `CREATE USER` / `GRANT` statements below are DDL against +> a Cloud service, so some agent sandboxes flag them as "modifying shared production +> infrastructure" even when `clickhousectl` is allowlisted. This is expected; the operations are +> scoped to a single dedicated ingest user and the `otel` schema, and the user should approve them +> explicitly when prompted. -clickhousectl cloud service query --id "$SERVICE_ID" --query \ - "CREATE USER IF NOT EXISTS hyperdx_ingest IDENTIFIED WITH sha256_password BY '$CLICKHOUSE_PASSWORD'" +**Pass the password-bearing statements over stdin, never as `--query`.** Interpolating +`$CLICKHOUSE_PASSWORD` into `--query "… BY '…'"` puts the secret in the process arg list (visible in +`ps`) and shell history, the very thing the skill avoids by using `docker --env-file` over `-e`. It +also trips auto-mode classifiers, which deny the call citing "the secret expanded on the command +line." Feed the DDL through `--queries-file -` (stdin) with a heredoc instead, so the password +reaches ClickHouse but never an argument: -# If the user already existed from a prior run, force the password to this run's value. -clickhousectl cloud service query --id "$SERVICE_ID" --query \ - "ALTER USER hyperdx_ingest IDENTIFIED WITH sha256_password BY '$CLICKHOUSE_PASSWORD'" +```bash +[ -f "$WORKDIR/creds.env" ] && . "$WORKDIR/creds.env"; set -a; . "$ENV_FILE"; set +a + +# CREATE then ALTER so a re-run forces the password to this run's value. The heredoc body is +# stdin, so $CLICKHOUSE_PASSWORD is never on the command line or in `ps`. +clickhousectl cloud service query --id "$SERVICE_ID" --queries-file - < **Older image builds:** some earlier collector versions ran their goose migrations against a -> version table in the `default` database, so startup looped on `ACCESS_DENIED` until -> `default.*` was also granted. If you see `ACCESS_DENIED` referencing `default` in the -> collector logs (Step 5), add this and restart the container: +> version table in the `default` database, so startup looped on `ACCESS_DENIED` until `default.*` +> was also granted. If you see `ACCESS_DENIED` referencing `default` in the collector logs +> (Step 6), add this and restart the container: > > ```bash > clickhousectl cloud service query --id "$SERVICE_ID" --query \ @@ -284,27 +372,69 @@ hyperdx_ingest`. --- -## Step 5: Deploy the ClickStack OpenTelemetry collector +## Step 6: Set up the collector -Run the ClickStack-distribution collector locally. It creates the `otel.*` schema on first -write and routes Session Replay events to `otel.hyperdx_sessions`. +Follow the sub-section that matches the path and mode you chose in Step 0. All three converge on +the same end state: a collector accepting OTLP and writing into the `otel` database on the service. -Make sure Docker is running: +Make sure Docker is running (new-collector path only): ```bash docker info > /dev/null ``` -Create a user-defined network. The collector joins it so the telemetry generator in Step 6 -can reach it by container name without any local install: +### Step 6a: New collector with Docker Compose (`DEPLOY_MODE=compose`) + +Write a Compose file in the working directory. It reads the same `collector.env` for secrets, +publishes the OTLP and health ports, and pins a named network so the telemetry generator in Step 7 +can reach the collector by container name: + +```bash +cat > "$WORKDIR/docker-compose.yaml" <<'EOF' +name: clickstack +services: + otel-collector: + image: clickhouse/clickstack-otel-collector:latest + container_name: clickstack-otel-collector + env_file: ./collector.env + ports: + - "4317:4317" # OTLP gRPC + - "4318:4318" # OTLP HTTP + - "13133:13133" # health + restart: unless-stopped + networks: [clickstack-net] +networks: + clickstack-net: + name: clickstack-net +EOF + +# Compose refuses to adopt a clickstack-net it did not create (a leftover from the docker run +# path, a prior failed Compose run, or a DEPLOY_MODE switch), failing with "network clickstack-net +# was found but has incorrect label". If an orphan exists with no containers attached, remove it so +# Compose can recreate it with its own labels. +if docker network inspect clickstack-net >/dev/null 2>&1 \ + && [ -z "$(docker network inspect clickstack-net -f '{{range .Containers}}{{.Name}} {{end}}')" ]; then + docker network rm clickstack-net +fi + +( cd "$WORKDIR" && docker compose up -d ) +``` + +Compose creates the `clickstack-net` network for you (the guard above clears an orphaned one from a +prior run first). Skip to **Step 6d** to confirm health. + +### Step 6b: New collector with individual Docker commands (`DEPLOY_MODE=run`) + +Create a user-defined network so the telemetry generator in Step 7 can reach the collector by +container name: ```bash docker network create clickstack-net 2>/dev/null || true ``` -Start the collector, passing **all secrets via `--env-file`** (never `-e`, which would put -the secret on the command line, in shell history, and in `ps`). Expose the health port too. -The `docker rm -f` first makes the step safe to re-run: +Start the collector, passing **all secrets via `--env-file`** (never `-e`, which would put the +secret on the command line, in shell history, and in `ps`). The `docker rm -f` first makes the step +safe to re-run: ```bash docker rm -f clickstack-otel-collector 2>/dev/null || true @@ -318,12 +448,119 @@ docker run -d \ clickhouse/clickstack-otel-collector:latest ``` -The image reads `OTLP_AUTH_TOKEN`, `CLICKHOUSE_ENDPOINT`, `CLICKHOUSE_USER`, -`CLICKHOUSE_PASSWORD`, and `HYPERDX_OTEL_EXPORTER_CLICKHOUSE_DATABASE` from the env file. It -enables bearer-token auth on the OTLP receiver with an empty scheme, so callers send the raw -token as the `authorization` header (no `Bearer ` prefix). +The image reads `OTLP_AUTH_TOKEN`, `CLICKHOUSE_ENDPOINT`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASSWORD`, +and `HYPERDX_OTEL_EXPORTER_CLICKHOUSE_DATABASE` from the env file. It enables bearer-token auth on +the OTLP receiver with an empty scheme, so callers send the raw token as the `authorization` header +(no `Bearer ` prefix). Continue to **Step 6d**. -Confirm it is healthy. The health check needs no install: +### Step 6c: Configure your existing collector (`COLLECTOR_PATH=existing`) + +Add the ClickHouse exporter to your existing collector configuration. The config below matches the +behavior of the ClickStack distribution, including the Session Replay (`rrweb`) routing path, and +writes into the `otel` database the ClickStack UI expects. + +Print the two values you need to substitute (do not paste the password into chat; read it from the +file on your own machine): + +```bash +grep -E '^CLICKHOUSE_ENDPOINT=' "$ENV_FILE" +# password lives here too: grep CLICKHOUSE_PASSWORD "$ENV_FILE" +``` + +Substitute `` and `` with those values, add this to +your collector config, and reload it: + +```yaml +receivers: + otlp/hyperdx: + protocols: + grpc: + include_metadata: true + endpoint: "0.0.0.0:4317" + http: + cors: + allowed_origins: ["*"] + allowed_headers: ["*"] + include_metadata: true + endpoint: "0.0.0.0:4318" + +processors: + batch: + memory_limiter: + limit_mib: 1500 + spike_limit_mib: 512 + check_interval: 5s + +connectors: + routing/logs: + default_pipelines: [logs/out-default] + error_mode: ignore + table: + - context: log + statement: route() where IsMatch(attributes["rr-web.event"], ".*") + pipelines: [logs/out-rrweb] + +exporters: + clickhouse: + database: otel + endpoint: + username: hyperdx_ingest + password: + ttl: 720h + timeout: 5s + retry_on_failure: + enabled: true + initial_interval: 5s + max_interval: 30s + max_elapsed_time: 300s + clickhouse/rrweb: + database: otel + endpoint: + username: hyperdx_ingest + password: + ttl: 720h + logs_table_name: hyperdx_sessions + timeout: 5s + retry_on_failure: + enabled: true + initial_interval: 5s + max_interval: 30s + max_elapsed_time: 300s + +service: + pipelines: + traces: + receivers: [otlp/hyperdx] + processors: [memory_limiter, batch] + exporters: [clickhouse] + metrics: + receivers: [otlp/hyperdx] + processors: [memory_limiter, batch] + exporters: [clickhouse] + logs/in: + receivers: [otlp/hyperdx] + exporters: [routing/logs] + logs/out-default: + receivers: [routing/logs] + processors: [memory_limiter, batch] + exporters: [clickhouse] + logs/out-rrweb: + receivers: [routing/logs] + processors: [memory_limiter, batch] + exporters: [clickhouse/rrweb] +``` + +Notes for this path: + +- If you use your own distribution, ensure it includes the ClickHouse exporter. The upstream + [contrib image](https://github.com/open-telemetry/opentelemetry-collector-contrib) already does. +- Authentication on the OTLP receivers is your existing setup. The `OTLP_AUTH_TOKEN` generated in + Step 2 is not used here unless you wire it into your own auth (for example `bearertokenauth`). +- After reloading, skip the health check below (that is specific to the local container) and go + straight to **Step 7** to send a verification burst (point the generator at your own collector's + OTLP endpoint). + +### Step 6d: Confirm the local collector is healthy (new-collector path) ```bash docker ps --filter name=clickstack-otel-collector --format '{{.Status}}' @@ -331,57 +568,111 @@ curl -fsS http://localhost:13133/ && echo docker logs --tail 40 clickstack-otel-collector 2>&1 | tail -40 ``` -A healthy start shows the seed migrations running to completion (`[seed] OK ...` lines ending -in `goose: up to current file version: N`), then `Everything is ready. Begin running and -processing data.` (or equivalent), `docker ps` reporting `Up ... (healthy)`, and the health -check returning HTTP 200. If instead the container exits, the cause is almost always in the -seed step: - -- `code: 516, Authentication failed: password is incorrect` → `CLICKHOUSE_PASSWORD` is empty - or wrong in the env file. The most common slip is storing the password under a different key - name (it **must** be `CLICKHOUSE_PASSWORD`), or using a password containing `@ : / ? # %`, - which corrupts the migration tool's connection URL. -- `[HTTP 403]` / `data size should be 0 < ` at "server hello" → same root cause: - an empty/wrong password against the HTTPS endpoint. -- TLS / dial errors → `CLICKHOUSE_ENDPOINT` is malformed (it must be `https://:8443`, - with no `.0` on the port). -- `ACCESS_DENIED` referencing `default` → only on older image builds; apply the `default.*` - grant from the Step 4 note and restart. +A healthy start shows the seed migrations running to completion (`[seed] OK ...` lines ending in +`goose: up to current file version: N`), then `Everything is ready. Begin running and processing +data.` (or equivalent), `docker ps` reporting `Up ... (healthy)`, and the health check returning +HTTP 200. A seed line like `ClickHouse 25.12 < 26.2, falling back to compatibility logs and traces +schemas` on an older server version is **expected and harmless**, not an error; do not pause on it. +If instead the container exits, the cause is almost always in the seed step: + +- `code: 516, Authentication failed: password is incorrect` -> `CLICKHOUSE_PASSWORD` is empty or + wrong in the env file. The most common slip is storing the password under a different key name + (it **must** be `CLICKHOUSE_PASSWORD`), or using a password containing `@ : / ? # %`, which + corrupts the migration tool's connection URL. +- `[HTTP 403]` / `data size should be 0 < ` at "server hello" -> same root cause: an + empty/wrong password against the HTTPS endpoint. +- TLS / dial errors -> `CLICKHOUSE_ENDPOINT` is malformed (it must be `https://:8443`, with + no `.0` on the port). +- `ACCESS_DENIED` referencing `default` -> only on older image builds; apply the `default.*` grant + from the Step 5 note and restart. --- -## Step 6: Send synthetic telemetry and verify ingestion +## Step 7: Send rich synthetic telemetry and verify ingestion -Use `telemetrygen` (the OpenTelemetry Collector Contrib generator). Run it from its **Docker -image** on the same network, so nothing is installed on the host. Its `--duration` flag -terminates the run reliably, so no watchdog wrapper is needed. +Use `telemetrygen` (the OpenTelemetry Collector Contrib generator) from its **Docker image**, so +nothing is installed on the host. Instead of one flat burst, send telemetry across several +**services**, **severities**, **span statuses**, and **metric types**, so ClickStack's Search, +Service Map, and dashboards have realistic, varied data rather than a single uniform stream. -Load the env file so the token is available as a variable, then reference `$OTLP_AUTH_TOKEN` -so the literal token never appears in the command text, your output, or shell history. -`telemetrygen`'s header syntax requires the value to be a quoted string: `key="value"`. +Load the env file so the token is available, then reference `$OTLP_AUTH_TOKEN` so the literal token +never appears in the command text, your output, or shell history. `telemetrygen`'s header syntax +requires the value to be a quoted string: `key="value"`. ```bash -set -a; . "$ENV_FILE"; set +a +[ -f "$WORKDIR/creds.env" ] && . "$WORKDIR/creds.env"; set -a; . "$ENV_FILE"; set +a TG_IMAGE=ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest +NET=clickstack-net +ENDPOINT=clickstack-otel-collector:4317 + tg() { # usage: tg [extra telemetrygen flags...] local signal="$1"; shift - docker run --rm --network clickstack-net "$TG_IMAGE" "$signal" \ - --otlp-endpoint clickstack-otel-collector:4317 \ + docker run --rm --network "$NET" "$TG_IMAGE" "$signal" \ + --otlp-endpoint "$ENDPOINT" \ --otlp-insecure \ --otlp-header "authorization=\"$OTLP_AUTH_TOKEN\"" \ - --rate 5 --duration 10s "$@" + --rate 10 --duration 15s "$@" } +``` + +> **Existing-collector path:** set `NET` and `ENDPOINT` to reach *your* collector instead. If it +> runs on this host, use `--network host` style access or point `ENDPOINT` at its published +> address, and set the `authorization` header (or other auth) to whatever your receiver expects. +> Everything below is otherwise identical. + +**Reading `telemetrygen` output: it is noisy and ends with a scary-looking but normal line.** Each +run prints verbose gRPC logs and then terminates with `rpc error: code = Canceled desc = grpc: the +client connection is closing` once `--duration` elapses. **That trailing line is expected shutdown, +not a failure.** The real failure signals are a **non-zero exit code** and an auth error such as +`code = Unauthenticated desc = provided authorization does not match expected scheme or token`. Do +not treat a clean, full-duration run as broken just because of the closing-connection line; judge +success by the row counts in the verification queries below, not by the generator's log volume. + +**Quote attribute values so the inner double quotes survive the shell.** `telemetrygen` requires +each attribute as `key="value"` (with literal double quotes), and rejects a bare `key=value` with +`value should be a string wrapped in double quotes`. If you write `--otlp-attributes +deployment.environment="production"`, bash strips the quotes and the container receives +`deployment.environment=production`, which hard-fails. Wrap the **whole argument in single quotes** +so the inner double quotes reach the container, exactly as the `tg` helper already does for the +auth header. + +Logs across two services with different severities and bodies, including an error line: + +```bash +tg logs --service checkout --severity-text Info --severity-number 9 \ + --body "checkout completed" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.method="POST"' +tg logs --service payment --severity-text Error --severity-number 17 \ + --body "payment gateway timeout" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.status_code="500"' +``` + +Traces with child spans, a healthy service and an erroring one (this is what populates the Service +Map and the error views): + +```bash +tg traces --service checkout --child-spans 4 --span-duration 120ms --status-code Ok \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.route="/cart"' +tg traces --service payment --child-spans 3 --span-duration 400ms --status-code Error \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.route="/charge"' +``` + +Metrics across the three common types, so dashboards have gauges, counters, and a distribution: -tg logs -tg traces -tg metrics --metric-type Sum +```bash +tg metrics --service checkout --metric-type Sum +tg metrics --service checkout --metric-type Gauge +tg metrics --service payment --metric-type Histogram ``` -(`--metric-type` accepts `Gauge`, `Sum`, `Histogram`, or `ExponentialHistogram`. Use `--otlp-http` -with `--otlp-endpoint clickstack-otel-collector:4318` if you want to exercise the HTTP path -instead of gRPC.) +(`--metric-type` accepts `Gauge`, `Sum`, `Histogram`, or `ExponentialHistogram`. Add `--otlp-http` +with `--otlp-endpoint clickstack-otel-collector:4318` to exercise the HTTP path instead of gRPC.) Wait ~15 seconds for the collector to flush its batch, then confirm the tables exist: @@ -405,63 +696,76 @@ clickhousectl cloud service query --id "$SERVICE_ID" --query \ GROUP BY table ORDER BY table" ``` -You should see non-zero `rows` for `otel_logs`, `otel_traces`, and `otel_metrics_sum`. If a -signal is missing: +You should see non-zero `rows` for `otel_logs`, `otel_traces`, `otel_metrics_sum`, +`otel_metrics_gauge`, and `otel_metrics_histogram`. If a signal is missing: 1. Tail the collector logs (`docker logs --tail 50 clickstack-otel-collector`) for export errors. -2. Confirm the `authorization` header matches `$OTLP_AUTH_TOKEN` (a mismatch shows in the - generator output as `code = Unauthenticated desc = provided authorization does not match - expected scheme or token`). +2. Confirm the `authorization` header matches `$OTLP_AUTH_TOKEN` (a mismatch shows in the generator + output as `code = Unauthenticated desc = provided authorization does not match expected scheme + or token`). 3. Re-check `CLICKHOUSE_ENDPOINT` has the `https://` scheme and `:8443` port. -4. Some metric kinds flush slowly. Re-run the count after another 30 seconds before declaring failure. +4. Some metric kinds flush slowly. Re-run the count after another 30 seconds before declaring + failure. Do not proceed until every expected signal has non-zero rows. --- -## Step 7: Own the last mile in ClickStack (wake the service, then complete onboarding) +## Step 8: Confirm the service is awake, then complete onboarding in ClickStack + +Rows in ClickHouse are **not** the same as the user seeing telemetry in ClickStack. The ClickStack +UI requires a one-time onboarding step that auto-detects the data sources, and that step fails if +the ClickHouse service has idle-suspended in the meantime. -Rows in ClickHouse are **not** the same as the user seeing telemetry in ClickStack. The -ClickStack UI requires a one-time onboarding step that detects the data sources, and that step -fails if the ClickHouse service has idle-suspended in the meantime. So immediately before -sending the user to the console, **wake the service and keep it awake**: +**First, confirm the service is awake.** Do not skip this; it is the most common reason onboarding +shows no sources. Run a real query and require it to succeed: ```bash clickhousectl cloud service query --id "$SERVICE_ID" --query "SELECT 1" ``` -Then tell the user to finish the onboarding in the ClickHouse Cloud console. Do not just say -"done", walk them through it: +If this returns `1`, the service is awake; continue immediately to the console steps below while it +stays warm. If it errors or times out, the service was asleep and this call is waking it: wait a +few seconds and re-run until it returns `1`. Only proceed once it succeeds. -1. Go back to the [ClickHouse Cloud console](https://console.clickhouse.cloud) and open the - target service. -2. Select **ClickStack** from the left-hand menu. -3. Open **Getting Started** in the left-hand menu and complete the onboarding flow to detect - sources. -4. The sources are **auto-detected**, and logs and traces become available in ClickStack. +**Then walk the user through onboarding explicitly.** Do not just say "done"; spell out each click, +because the sources only appear after this flow is completed: -The direct link is `https://console.clickhouse.cloud/services//clickstack`. +1. Go to the [ClickHouse Cloud console](https://console.clickhouse.cloud) and open the target + service. +2. In the **left-hand menu, select ClickStack**. +3. Click through to **Getting Started** and follow the onboarding flow. +4. **Ignore any prompt that asks you to set up or configure a collector / start ingestion.** You + have already done that in the steps above. Skip straight past those screens (click through / + "Next") to source detection. Re-running the console's collector setup is unnecessary and only + causes confusion. +5. The data sources are **auto-detected**: logs, traces, and metrics for the `otel` database are + picked up automatically, and your data appears in the Search and dashboard views. -**If source detection fails**, the service almost certainly idle-suspended between the data -send and the console step. Re-run the wake query above for the user, then have them re-run the -detection, rather than leaving them to debug an opaque failure. +The direct link is `https://console.clickhouse.cloud/services//clickstack` (substitute +`$SERVICE_ID`). + +**If source detection shows nothing**, the service almost certainly idle-suspended between the data +send and the console step. Re-run the `SELECT 1` wake query above for the user, then have them +re-run the detection, rather than leaving them to debug an opaque failure. --- -## Step 8: Summarize and hand off (without echoing secrets) +## Step 9: Summarize and hand off (without echoing secrets) -Print a summary in exactly this format. Note the token is **referenced, not printed**, the -SQL password is not shown at all, and the collector keeps running until the user stops it: +Print a summary in roughly this format. Note the token is **referenced, not printed**, the SQL +password is not shown at all, and the collector keeps running until the user stops it. Adjust the +"how to stop" line to the deploy mode they chose. ``` -✅ ClickStack is set up and ingesting telemetry for service (), - and the data sources are detected in ClickStack. +✅ ClickStack is set up and ingesting telemetry for service (). + Complete onboarding in the console (Step 8) to auto-detect the sources and see your data. -Local OpenTelemetry collector - ▸ Container: clickstack-otel-collector (Docker, network clickstack-net) +Collector + ▸ New local collector via (or: configured your existing collector) ▸ Send OTLP gRPC to: localhost:4317 ▸ Send OTLP HTTP to: localhost:4318 - ▸ Health check: http://localhost:13133/ + ▸ Health check: http://localhost:13133/ (local collector only) ▸ Required header: authorization: (retrieve with: grep OTLP_AUTH_TOKEN /collector.env) @@ -478,10 +782,12 @@ Finish in the ClickHouse Cloud console: Then tell the user, in your own words, that: -1. All secrets live in `/collector.env` (mode `0600`). Nothing sensitive was pasted - into this chat or passed on a `docker run` command line. -2. The collector keeps running until they stop it: `docker stop clickstack-otel-collector` - shuts it down, `docker start clickstack-otel-collector` brings it back. +1. All secrets live in `/collector.env` (mode `0600`). Nothing sensitive was pasted into + this chat or passed on a `docker run` command line. +2. The collector keeps running until they stop it. For Compose: + `cd && docker compose down` stops it, `docker compose up -d` brings it back. For + individual Docker: `docker stop clickstack-otel-collector` and `docker start + clickstack-otel-collector`. 3. Any application, SDK, or agent on this host can now send OTLP to `localhost:4317` (gRPC) or `localhost:4318` (HTTP) with the `authorization` header from the env file. 4. The ClickHouse Cloud service idle-suspends. If ClickStack later shows no recent data, the @@ -492,11 +798,18 @@ Then tell the user, in your own words, that: ## Cleanup (only if the user explicitly asks) ```bash +# Docker Compose deployment: +( cd "$WORKDIR" && docker compose down ) + +# Individual Docker deployment: docker rm -f clickstack-otel-collector docker network rm clickstack-net 2>/dev/null || true + +# Either deployment, remove the ingest user: clickhousectl cloud service query --id "$SERVICE_ID" --query "DROP USER IF EXISTS hyperdx_ingest" -# Optionally remove the local secrets once they are no longer needed: -# rm -f "$WORKDIR/collector.env" "$WORKDIR/svc.json" + +# Optionally remove the local files once they are no longer needed: +# rm -f "$WORKDIR/collector.env" "$WORKDIR/svc.json" "$WORKDIR/docker-compose.yaml" ``` Do **not** drop the `otel` database: it contains telemetry the user may want to retain. From a4d6f9255937bf15f1fd4274a2ca735f6bf1fbb4 Mon Sep 17 00:00:00 2001 From: Dale McDiarmid Date: Mon, 15 Jun 2026 14:09:26 +0100 Subject: [PATCH 3/4] more refinement to skill --- .../skills/clickstack-otel-collector/SKILL.md | 131 ++++++++++++------ 1 file changed, 85 insertions(+), 46 deletions(-) diff --git a/static/skills/clickstack-otel-collector/SKILL.md b/static/skills/clickstack-otel-collector/SKILL.md index c6fdf71a522..1cd29833d78 100644 --- a/static/skills/clickstack-otel-collector/SKILL.md +++ b/static/skills/clickstack-otel-collector/SKILL.md @@ -4,7 +4,7 @@ description: Use when a user wants to wire an OpenTelemetry collector into a Man license: Apache-2.0 metadata: author: ClickHouse Inc - version: "0.5.0" + version: "0.6.0" --- # Set up an OpenTelemetry collector for Managed ClickStack @@ -160,12 +160,20 @@ your receiver is your own setup); it is generated only so the same file works if to the local collector. The `CLICKHOUSE_*` values are still used: they go into the exporter config you add to your collector in Step 6. -From now on, load the file when you need a value instead of typing secrets: +**Every later step runs in a fresh shell, so `WORKDIR`, `ENV_FILE`, and any exported credentials do +not persist, and `WORKDIR`/`ENV_FILE` are not stored inside the env file, so sourcing it can't +recover them.** Begin each subsequent step's shell with this **standard preamble**, which +re-derives the paths from the deterministic default, loads the saved credentials (Step 3), and +loads the config: ```bash -set -a; . "$ENV_FILE"; set +a +WORKDIR="${WORKDIR:-$HOME/clickstack-otel-collector}"; ENV_FILE="$WORKDIR/collector.env" +[ -f "$WORKDIR/creds.env" ] && . "$WORKDIR/creds.env"; set -a; . "$ENV_FILE"; set +a ``` +If you chose a non-default `WORKDIR`, set it explicitly at the top of every step (the `${WORKDIR:-…}` +default only covers the standard location). Later steps refer to this as "the standard preamble". + **Confirm with the user** that `SERVICE_REF` is correct. Tell them the working directory and that `collector.env` (mode `0600`) now holds the OTLP token and the SQL password. Do **not** print either secret. If they want to see a value, point them at the file @@ -265,9 +273,14 @@ succeed without `creds.env`, you can skip this; but most agent shells need it.) ## Step 4: Resolve the service and capture the HTTPS endpoint -Open the shell with the combined load so credentials and config are both present -(`[ -f "$WORKDIR/creds.env" ] && . "$WORKDIR/creds.env"; set -a; . "$ENV_FILE"; set +a`), then -resolve the service. If `SERVICE_REF` is a UUID, use it directly; otherwise look it up by name: +Run the standard preamble (Step 2) so the paths, credentials, and config are all loaded in this +shell, then resolve the service. If `SERVICE_REF` is a UUID, use it directly; otherwise look it up +by name: + +```bash +WORKDIR="${WORKDIR:-$HOME/clickstack-otel-collector}"; ENV_FILE="$WORKDIR/collector.env" +[ -f "$WORKDIR/creds.env" ] && . "$WORKDIR/creds.env"; set -a; . "$ENV_FILE"; set +a +``` ```bash # UUID form @@ -314,9 +327,8 @@ service ''...`, which is expected. ## Step 5: Create the `hyperdx_ingest` SQL user and grant it `otel.*` This step is the same on both paths: the collector (new or existing) authenticates to ClickHouse -as `hyperdx_ingest`. The user name is fixed and the password charset (Step 2) needs no escaping, so -single-quoting it in SQL is safe. Open the shell with the combined load so `$CLICKHOUSE_PASSWORD` -(and credentials) are set. +as `hyperdx_ingest`. Open the shell with the combined load so `$CLICKHOUSE_PASSWORD` (and +credentials) are set. > **Expect an approval prompt here.** The `CREATE USER` / `GRANT` statements below are DDL against > a Cloud service, so some agent sandboxes flag them as "modifying shared production @@ -324,27 +336,36 @@ single-quoting it in SQL is safe. Open the shell with the combined load so `$CLI > scoped to a single dedicated ingest user and the `otel` schema, and the user should approve them > explicitly when prompted. -**Pass the password-bearing statements over stdin, never as `--query`.** Interpolating -`$CLICKHOUSE_PASSWORD` into `--query "… BY '…'"` puts the secret in the process arg list (visible in -`ps`) and shell history, the very thing the skill avoids by using `docker --env-file` over `-e`. It -also trips auto-mode classifiers, which deny the call citing "the secret expanded on the command -line." Feed the DDL through `--queries-file -` (stdin) with a heredoc instead, so the password -reaches ClickHouse but never an argument: +**Never put the plaintext password in the SQL. Hash it locally and use `sha256_hash`.** Two +problems rule out `IDENTIFIED WITH sha256_password BY '$CLICKHOUSE_PASSWORD'`: the secret would +land in the process arg list (visible in `ps`) and shell history, and, critically, **the Query API +echoes the failing statement verbatim in its error JSON**, so any error (a transient failure, a +charset slip) leaks the password into output an agent may surface. Passing it over stdin does not +help, the error echo still contains it. Instead compute the SHA-256 hash of the password locally +(`sha256_hash` stores exactly what `sha256_password` would, so the collector still logs in with the +plaintext from the env file) and put only the **hash** in the statement. A hash is non-reversible, +so even an echoed error cannot leak the password: ```bash +WORKDIR="${WORKDIR:-$HOME/clickstack-otel-collector}"; ENV_FILE="$WORKDIR/collector.env" [ -f "$WORKDIR/creds.env" ] && . "$WORKDIR/creds.env"; set -a; . "$ENV_FILE"; set +a -# CREATE then ALTER so a re-run forces the password to this run's value. The heredoc body is -# stdin, so $CLICKHOUSE_PASSWORD is never on the command line or in `ps`. -clickhousectl cloud service query --id "$SERVICE_ID" --queries-file - < /dev/null ``` @@ -459,16 +484,22 @@ Add the ClickHouse exporter to your existing collector configuration. The config behavior of the ClickStack distribution, including the Session Replay (`rrweb`) routing path, and writes into the `otel` database the ClickStack UI expects. -Print the two values you need to substitute (do not paste the password into chat; read it from the -file on your own machine): +**Reference the endpoint and password as environment variables (`${env:…}`), do not hardcode them +into the config file.** The contrib collector expands `${env:VAR}` at load time, so keeping the +plaintext password out of the config file is both safer and consistent with the rest of this skill. +Start your collector with the env vars available, the simplest way is the same `--env-file` the +local collector uses: ```bash -grep -E '^CLICKHOUSE_ENDPOINT=' "$ENV_FILE" -# password lives here too: grep CLICKHOUSE_PASSWORD "$ENV_FILE" +# When running the contrib collector in Docker, pass collector.env so ${env:CLICKHOUSE_*} resolve: +# docker run -d --env-file "$ENV_FILE" -p 4317:4317 -p 4318:4318 \ +# -v "$WORKDIR/your-config.yaml:/etc/otelcol-contrib/config.yaml:ro" \ +# otel/opentelemetry-collector-contrib:latest +# For a non-Docker collector, export CLICKHOUSE_ENDPOINT and CLICKHOUSE_PASSWORD into its +# environment (e.g. an EnvironmentFile= in the systemd unit) before it starts. ``` -Substitute `` and `` with those values, add this to -your collector config, and reload it: +Add this to your collector config and reload it: ```yaml receivers: @@ -503,9 +534,9 @@ connectors: exporters: clickhouse: database: otel - endpoint: + endpoint: ${env:CLICKHOUSE_ENDPOINT} username: hyperdx_ingest - password: + password: ${env:CLICKHOUSE_PASSWORD} ttl: 720h timeout: 5s retry_on_failure: @@ -515,9 +546,9 @@ exporters: max_elapsed_time: 300s clickhouse/rrweb: database: otel - endpoint: + endpoint: ${env:CLICKHOUSE_ENDPOINT} username: hyperdx_ingest - password: + password: ${env:CLICKHOUSE_PASSWORD} ttl: 720h logs_table_name: hyperdx_sessions timeout: 5s @@ -595,25 +626,31 @@ nothing is installed on the host. Instead of one flat burst, send telemetry acro **services**, **severities**, **span statuses**, and **metric types**, so ClickStack's Search, Service Map, and dashboards have realistic, varied data rather than a single uniform stream. -Load the env file so the token is available, then reference `$OTLP_AUTH_TOKEN` so the literal token -never appears in the command text, your output, or shell history. `telemetrygen`'s header syntax -requires the value to be a quoted string: `key="value"`. +Load the env file so the token is available, then reference `$OTLP_AUTH_TOKEN`. The `tg` helper +below **redirects all generator output to a log file** and prints only an exit code, because +`telemetrygen` echoes its full config, **including the `authorization` header (your OTLP token)**, +to stdout. Never surface that raw output in the chat. `telemetrygen`'s header syntax requires the +value to be a quoted string: `key="value"`. ```bash +WORKDIR="${WORKDIR:-$HOME/clickstack-otel-collector}"; ENV_FILE="$WORKDIR/collector.env" [ -f "$WORKDIR/creds.env" ] && . "$WORKDIR/creds.env"; set -a; . "$ENV_FILE"; set +a TG_IMAGE=ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest NET=clickstack-net ENDPOINT=clickstack-otel-collector:4317 +TG_LOG="$WORKDIR/telemetrygen.log"; : > "$TG_LOG" tg() { # usage: tg [extra telemetrygen flags...] + # Output (which contains the token in the echoed config) goes to $TG_LOG, never the terminal. local signal="$1"; shift docker run --rm --network "$NET" "$TG_IMAGE" "$signal" \ --otlp-endpoint "$ENDPOINT" \ --otlp-insecure \ --otlp-header "authorization=\"$OTLP_AUTH_TOKEN\"" \ - --rate 10 --duration 15s "$@" + --rate 10 --duration 15s "$@" >>"$TG_LOG" 2>&1 + echo "$signal exit=$?" } ``` @@ -622,13 +659,14 @@ tg() { > address, and set the `authorization` header (or other auth) to whatever your receiver expects. > Everything below is otherwise identical. -**Reading `telemetrygen` output: it is noisy and ends with a scary-looking but normal line.** Each -run prints verbose gRPC logs and then terminates with `rpc error: code = Canceled desc = grpc: the -client connection is closing` once `--duration` elapses. **That trailing line is expected shutdown, -not a failure.** The real failure signals are a **non-zero exit code** and an auth error such as -`code = Unauthenticated desc = provided authorization does not match expected scheme or token`. Do -not treat a clean, full-duration run as broken just because of the closing-connection line; judge -success by the row counts in the verification queries below, not by the generator's log volume. +**Judge success by exit code and row counts, never by the generator's logs.** Two reasons. First, +the log contains your OTLP token (see above), so do not print it. Second, it is noisy and every run +ends with `rpc error: code = Canceled desc = grpc: the client connection is closing` once +`--duration` elapses, which is **expected shutdown, not a failure**. The `tg` helper already prints +` exit=0` on success. If you must inspect a failure, grep the log for the real signal +without dumping it, for example `grep -c Unauthenticated "$TG_LOG"` (a non-zero count plus a +non-zero exit means the `authorization` header did not match). Confirm overall success with the row +counts in the verification queries below. **Quote attribute values so the inner double quotes survive the shell.** `telemetrygen` requires each attribute as `key="value"` (with literal double quotes), and rejects a bare `key=value` with @@ -700,9 +738,10 @@ You should see non-zero `rows` for `otel_logs`, `otel_traces`, `otel_metrics_sum `otel_metrics_gauge`, and `otel_metrics_histogram`. If a signal is missing: 1. Tail the collector logs (`docker logs --tail 50 clickstack-otel-collector`) for export errors. -2. Confirm the `authorization` header matches `$OTLP_AUTH_TOKEN` (a mismatch shows in the generator - output as `code = Unauthenticated desc = provided authorization does not match expected scheme - or token`). +2. Confirm the `authorization` header matches `$OTLP_AUTH_TOKEN`: `grep -c Unauthenticated + "$TG_LOG"` (a non-zero count means a mismatch, the full message is `code = Unauthenticated desc = + provided authorization does not match expected scheme or token`). Grep rather than print the + log, since it contains the token. 3. Re-check `CLICKHOUSE_ENDPOINT` has the `https://` scheme and `:8443` port. 4. Some metric kinds flush slowly. Re-run the count after another 30 seconds before declaring failure. From c4f6b0667e41cc939ef9305c7e25eb5fce585f2b Mon Sep 17 00:00:00 2001 From: Dale McDiarmid Date: Mon, 15 Jun 2026 15:27:10 +0100 Subject: [PATCH 4/4] more docs refinement --- .../example-datasets/telemetrygen.md | 150 +++++++----------- .../clickstack/ingesting-data/sdks/index.md | 2 +- .../_snippets/_confirm_in_ui.md | 6 +- ...setting-up-your-opentelemetry-collector.md | 150 +++++++++++++----- 4 files changed, 178 insertions(+), 130 deletions(-) diff --git a/docs/use-cases/observability/clickstack/example-datasets/telemetrygen.md b/docs/use-cases/observability/clickstack/example-datasets/telemetrygen.md index b55913e395e..e0f5bd2611f 100644 --- a/docs/use-cases/observability/clickstack/example-datasets/telemetrygen.md +++ b/docs/use-cases/observability/clickstack/example-datasets/telemetrygen.md @@ -55,27 +55,30 @@ export OTLP_AUTH_TOKEN= ``` :::note[Unsecured collector] -The ClickStack OpenTelemetry collector is unauthenticated by default. If you haven't followed [Securing the collector](/use-cases/observability/clickstack/ingesting-data/otel-collector#securing-the-collector) to set an `OTLP_AUTH_TOKEN`, drop the `--otlp-header` flag from the commands below. +The ClickStack OpenTelemetry collector is unauthenticated by default. If you haven't followed [Securing the collector](/use-cases/observability/clickstack/ingesting-data/otel-collector#securing-the-collector) to set an `OTLP_AUTH_TOKEN`, drop the `--otlp-header` line from the helper below. ::: -### Generate logs {#generate-logs-managed} - -Send logs from two services with different severities, bodies, and attributes, so the `Search` view has both informational and error events to filter on: +Define a small `tg` helper so each command only specifies what varies (service, severity, status, attributes): ```shell -telemetrygen logs \ +tg() { local signal=$1; shift; telemetrygen "$signal" \ --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service checkout --rate 5 --duration 30s \ - --severity-text Info --severity-number 9 --body "checkout completed" \ + --rate 5 --duration 30s "$@"; } +``` + +### Generate logs {#generate-logs-managed} + +Send logs as a realistic mix of severities across services, mostly informational with a warning and an error rather than one uniform stream: + +```shell +tg logs --service frontend --severity-text Info --severity-number 9 --body "GET /api/products 200" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.method="GET"' --telemetry-attributes 'http.status_code="200"' +tg logs --service checkout --severity-text Warn --severity-number 13 --body "retrying payment authorization" \ --otlp-attributes 'deployment.environment="production"' \ --telemetry-attributes 'http.method="POST"' - -telemetrygen logs \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service payment --rate 5 --duration 30s \ - --severity-text Error --severity-number 17 --body "payment gateway timeout" \ +tg logs --service payment --severity-text Error --severity-number 17 --body "payment gateway timeout" \ --otlp-attributes 'deployment.environment="production"' \ --telemetry-attributes 'http.status_code="500"' ``` @@ -90,22 +93,18 @@ The most useful log flags: ### Generate traces {#generate-traces-managed} -Send multi-span traces, one healthy service and one returning errors. The child spans and error status populate the Service Map and the error views: +Send multi-span traces from several healthy services plus one failing dependency. This gives the Service Map a realistic shape, mostly healthy with one erroring service, and populates the error views: ```shell -telemetrygen traces \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service checkout --rate 5 --duration 30s --workers 2 \ - --child-spans 4 --span-duration 120ms --span-links 1 --status-code Ok \ - --otlp-attributes 'deployment.environment="production"' \ - --telemetry-attributes 'http.route="/cart"' - -telemetrygen traces \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service payment --rate 5 --duration 30s \ - --child-spans 3 --span-duration 400ms --status-code Error \ +# Healthy services: the bulk of the traffic, all spans Ok +for svc in frontend checkout cart; do + tg traces --service "$svc" --child-spans 3 --span-duration 80ms --status-code Ok \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes "http.route=\"/$svc\"" +done + +# One slow dependency returning errors +tg traces --service payment --child-spans 3 --span-duration 450ms --span-links 1 --status-code Error \ --otlp-attributes 'deployment.environment="production"' \ --telemetry-attributes 'http.route="/charge"' ``` @@ -123,30 +122,16 @@ The most useful trace flags: Send the three common metric types so dashboards have counters, gauges, and a distribution. Unlike some generators, `telemetrygen` honors `--duration` for metrics, so no manual stop is needed: ```shell -telemetrygen metrics --metric-type Sum \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service checkout --otlp-metric-name http.server.requests \ - --aggregation-temporality cumulative --rate 5 --duration 30s - -telemetrygen metrics --metric-type Gauge \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service checkout --otlp-metric-name system.memory.usage \ - --rate 5 --duration 30s - -telemetrygen metrics --metric-type Histogram \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service payment --otlp-metric-name http.server.duration \ - --rate 5 --duration 30s +tg metrics --service frontend --metric-type Sum --otlp-metric-name http.server.requests --aggregation-temporality cumulative +tg metrics --service frontend --metric-type Gauge --otlp-metric-name system.memory.usage +tg metrics --service payment --metric-type Histogram --otlp-metric-name http.server.duration ``` `--metric-type` accepts `Gauge`, `Sum`, `Histogram`, or `ExponentialHistogram`. `--otlp-metric-name` names the series so you can find it in the UI, and `--aggregation-temporality` is `delta` or `cumulative`. ### Verify in ClickStack {#verify-managed} -Open the ClickStack UI from the ClickHouse Cloud console. In the `Search` view, set the time range to `Last 15 minutes` and switch the source between `Logs` and `Traces`. Filter on `ServiceName` to see the `checkout` and `payment` services, and on `SeverityText` to find the `Error` log line. Open a `payment` trace to see the child spans and the error status. Open the `Chart Explorer`, select `Metrics`, and chart one of the metric names you set above (for example `http.server.requests`) to verify metrics ingestion. +Open the ClickStack UI from the ClickHouse Cloud console. In the `Search` view, set the time range to `Last 15 minutes` and switch the source between `Logs` and `Traces`. Filter on `ServiceName` to see the `frontend`, `checkout`, `cart`, and `payment` services, and on `SeverityText` to find the warning and error log lines. Open a `payment` trace to see the child spans and the error status. Open the `Chart Explorer`, select `Metrics`, and chart one of the metric names you set above (for example `http.server.requests`) to verify metrics ingestion. @@ -186,79 +171,64 @@ Export the ingestion API key: export CLICKSTACK_API_KEY= ``` -### Generate logs {#generate-logs-oss} - -Send logs from two services with different severities, bodies, and attributes: +Define a small `tg` helper so each command only specifies what varies (service, severity, status, attributes): ```shell -telemetrygen logs \ +tg() { local signal=$1; shift; telemetrygen "$signal" \ --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ - --service checkout --rate 5 --duration 30s \ - --severity-text Info --severity-number 9 --body "checkout completed" \ + --rate 5 --duration 30s "$@"; } +``` + +### Generate logs {#generate-logs-oss} + +Send logs as a realistic mix of severities across services, mostly informational with a warning and an error rather than one uniform stream: + +```shell +tg logs --service frontend --severity-text Info --severity-number 9 --body "GET /api/products 200" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.method="GET"' --telemetry-attributes 'http.status_code="200"' +tg logs --service checkout --severity-text Warn --severity-number 13 --body "retrying payment authorization" \ --otlp-attributes 'deployment.environment="production"' \ --telemetry-attributes 'http.method="POST"' - -telemetrygen logs \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ - --service payment --rate 5 --duration 30s \ - --severity-text Error --severity-number 17 --body "payment gateway timeout" \ +tg logs --service payment --severity-text Error --severity-number 17 --body "payment gateway timeout" \ --otlp-attributes 'deployment.environment="production"' \ --telemetry-attributes 'http.status_code="500"' ``` ### Generate traces {#generate-traces-oss} -Send multi-span traces, one healthy service and one returning errors: +Send multi-span traces from several healthy services plus one failing dependency. This gives the Service Map a realistic shape, mostly healthy with one erroring service, and populates the error views: ```shell -telemetrygen traces \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ - --service checkout --rate 5 --duration 30s --workers 2 \ - --child-spans 4 --span-duration 120ms --span-links 1 --status-code Ok \ - --otlp-attributes 'deployment.environment="production"' \ - --telemetry-attributes 'http.route="/cart"' - -telemetrygen traces \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ - --service payment --rate 5 --duration 30s \ - --child-spans 3 --span-duration 400ms --status-code Error \ +# Healthy services: the bulk of the traffic, all spans Ok +for svc in frontend checkout cart; do + tg traces --service "$svc" --child-spans 3 --span-duration 80ms --status-code Ok \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes "http.route=\"/$svc\"" +done + +# One slow dependency returning errors +tg traces --service payment --child-spans 3 --span-duration 450ms --span-links 1 --status-code Error \ --otlp-attributes 'deployment.environment="production"' \ --telemetry-attributes 'http.route="/charge"' ``` ### Generate metrics {#generate-metrics-oss} -Send the three common metric types: +Send the three common metric types so charts have a counter, a gauge, and a distribution: ```shell -telemetrygen metrics --metric-type Sum \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ - --service checkout --otlp-metric-name http.server.requests \ - --aggregation-temporality cumulative --rate 5 --duration 30s - -telemetrygen metrics --metric-type Gauge \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ - --service checkout --otlp-metric-name system.memory.usage \ - --rate 5 --duration 30s - -telemetrygen metrics --metric-type Histogram \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${CLICKSTACK_API_KEY}\"" \ - --service payment --otlp-metric-name http.server.duration \ - --rate 5 --duration 30s +tg metrics --service frontend --metric-type Sum --otlp-metric-name http.server.requests --aggregation-temporality cumulative +tg metrics --service frontend --metric-type Gauge --otlp-metric-name system.memory.usage +tg metrics --service payment --metric-type Histogram --otlp-metric-name http.server.duration ``` `--metric-type` accepts `Gauge`, `Sum`, `Histogram`, or `ExponentialHistogram`. ### Verify in ClickStack {#verify-oss} -Visit [http://localhost:8080](http://localhost:8080) to open the ClickStack UI. In the `Search` view, set the time range to `Last 15 minutes` and switch the source between `Logs` and `Traces`. Filter on `ServiceName` to see the `checkout` and `payment` services, and on `SeverityText` to find the `Error` log line. Open a `payment` trace to see the child spans and the error status. Open the `Chart Explorer`, select `Metrics`, and chart one of the metric names you set above (for example `http.server.requests`) to verify metrics ingestion. +Visit [http://localhost:8080](http://localhost:8080) to open the ClickStack UI. In the `Search` view, set the time range to `Last 15 minutes` and switch the source between `Logs` and `Traces`. Filter on `ServiceName` to see the `frontend`, `checkout`, `cart`, and `payment` services, and on `SeverityText` to find the warning and error log lines. Open a `payment` trace to see the child spans and the error status. Open the `Chart Explorer`, select `Metrics`, and chart one of the metric names you set above (for example `http.server.requests`) to verify metrics ingestion. diff --git a/docs/use-cases/observability/clickstack/ingesting-data/sdks/index.md b/docs/use-cases/observability/clickstack/ingesting-data/sdks/index.md index 0d6ecdbd18e..c5796cf7105 100644 --- a/docs/use-cases/observability/clickstack/ingesting-data/sdks/index.md +++ b/docs/use-cases/observability/clickstack/ingesting-data/sdks/index.md @@ -46,7 +46,7 @@ While ClickStack offers its own language SDKs with enhanced telemetry and featur ## Securing with API key {#securing-api-key} -:::Not required for Managed ClickStack +:::note Not required for Managed ClickStack The API key isn't required for managed ClickStack. ::: diff --git a/docs/use-cases/observability/clickstack/managed-onboarding/_snippets/_confirm_in_ui.md b/docs/use-cases/observability/clickstack/managed-onboarding/_snippets/_confirm_in_ui.md index 5885b9dbd0a..60a85392991 100644 --- a/docs/use-cases/observability/clickstack/managed-onboarding/_snippets/_confirm_in_ui.md +++ b/docs/use-cases/observability/clickstack/managed-onboarding/_snippets/_confirm_in_ui.md @@ -14,16 +14,16 @@ ClickStack will open in a new tab and you should be automatically directed to th ClickStack Start Ingestion -ClickStack should automatically detect your tables and telemetry data, allowing you to proceed. Select **Start Exploring** to begin exploring your trace data. +You'll land on the **Getting Started** page. If you're prompted to begin ingestion, proceed through the onboarding screens that guide you to start a Docker collector. Since you've already completed that step, you can simply click through them. ClickStack will automatically detect your telemetry data and tables, after which you can select **Start Exploring** to begin using the platform. ClickStack Start Exploring -Switch the source to `Logs` and set the time range to **Last 15 minutes**. The synthetic logs from `otelgen` should appear within a few seconds. +Switch the source to `Logs` and set the time range to **Last 15 minutes**. The synthetic logs from `telemetrygen` should appear within a few seconds. ClickStack Search view with logs appearing If nothing shows up: -- Confirm the auth header value passed to `otelgen` matches the one your collector expects. +- Confirm the auth header value passed to `telemetrygen` matches the one your collector expects. - Tail your collector's logs and look for export errors. - Verify the ClickHouse endpoint configured on the collector includes both the protocol and port (`https://...:8443`). diff --git a/docs/use-cases/observability/clickstack/managed-onboarding/setting-up-your-opentelemetry-collector.md b/docs/use-cases/observability/clickstack/managed-onboarding/setting-up-your-opentelemetry-collector.md index 9549fa8110a..2b9affa0149 100644 --- a/docs/use-cases/observability/clickstack/managed-onboarding/setting-up-your-opentelemetry-collector.md +++ b/docs/use-cases/observability/clickstack/managed-onboarding/setting-up-your-opentelemetry-collector.md @@ -112,41 +112,75 @@ go install github.com/open-telemetry/opentelemetry-collector-contrib/cmd/telemet export OTEL_ENDPOINT=localhost:4317 ``` -Send logs tagged with a service, environment, and severity: +:::note What to expect +Each `tg` run lasts about 20 seconds (set by `--duration 20s`) and streams verbose logs the whole time, that's expected; each run returns on its own once its 20 seconds elapse. The logs below are enough to confirm the pipeline; the optional richer set adds several more runs and takes a few minutes. +::: + +Define a small `tg` helper so each command only specifies what varies (service, severity, status, attributes): ```shell -telemetrygen logs \ +tg() { local signal=$1; shift; telemetrygen "$signal" \ --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service checkout --rate 5 --duration 30s \ - --severity-text Error --severity-number 17 --body "payment gateway timeout" \ + --rate 5 --duration 20s "$@"; } +``` + +Send logs as a realistic mix of severities across services, mostly informational with a warning and an error rather than one uniform error stream: + +```shell +tg logs --service frontend --severity-text Info --severity-number 9 --body "GET /api/products 200" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.method="GET"' --telemetry-attributes 'http.status_code="200"' +tg logs --service checkout --severity-text Warn --severity-number 13 --body "retrying payment authorization" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.method="POST"' +tg logs --service payment --severity-text Error --severity-number 17 --body "payment gateway timeout" \ --otlp-attributes 'deployment.environment="production"' \ --telemetry-attributes 'http.status_code="500"' ``` -Send multi-span traces with child spans and an error status, which populate the Service Map and error views: +Logs alone confirm the endpoint is working. For a richer demo dataset, a multi-service Service Map and charts across metric types, expand and run the commands below as well. They reuse the `tg` helper, so run them in the same shell. + +
+Generate richer telemetry (optional) + +Send multi-span traces from several healthy services plus one failing dependency. This gives the Service Map a realistic shape, mostly healthy with one erroring service, and still populates the error views: ```shell -telemetrygen traces \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service checkout --rate 5 --duration 30s \ - --child-spans 4 --span-duration 120ms --span-links 1 --status-code Error \ +# Healthy services: the bulk of the traffic, all spans Ok +for svc in frontend checkout cart; do + tg traces --service "$svc" --child-spans 3 --span-duration 80ms --status-code Ok \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes "http.route=\"/$svc\"" +done + +# One slow dependency returning errors +tg traces --service payment --child-spans 3 --span-duration 450ms --span-links 1 --status-code Error \ --otlp-attributes 'deployment.environment="production"' \ - --telemetry-attributes 'http.route="/cart"' + --telemetry-attributes 'http.route="/charge"' ``` -Send metrics of a given type with a named series: +Send metrics across the three common types, so charts have a counter, a gauge, and a distribution: ```shell -telemetrygen metrics --metric-type Sum \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service checkout --otlp-metric-name http.server.requests \ - --aggregation-temporality cumulative --rate 5 --duration 30s +tg metrics --service frontend --metric-type Sum --otlp-metric-name http.server.requests --aggregation-temporality cumulative +tg metrics --service frontend --metric-type Gauge --otlp-metric-name system.memory.usage +tg metrics --service payment --metric-type Histogram --otlp-metric-name http.server.duration +``` + +For even more variety, add a couple more services and an exponential histogram: + +```shell +tg traces --service catalog --child-spans 2 --span-duration 60ms --status-code Ok \ + --otlp-attributes 'deployment.environment="production"' --telemetry-attributes 'http.route="/catalog"' +tg traces --service inventory --child-spans 2 --span-duration 220ms --status-code Ok \ + --otlp-attributes 'deployment.environment="staging"' --telemetry-attributes 'http.route="/inventory"' +tg metrics --service payment --metric-type ExponentialHistogram --otlp-metric-name db.query.duration ``` -For the full set of flags, variations across multiple services and metric types, and verification tips, see [Synthetic data with telemetrygen](/use-cases/observability/clickstack/getting-started/telemetrygen). +
+ +For the full set of flags and more variations, see [Synthetic data with telemetrygen](/use-cases/observability/clickstack/getting-started/telemetrygen). ## Confirm in the ClickStack UI {#confirm-in-ui} @@ -291,41 +325,75 @@ Or install the binary with Go: go install github.com/open-telemetry/opentelemetry-collector-contrib/cmd/telemetrygen@latest ``` -Send logs tagged with a service, environment, and severity: +:::note What to expect +Each `tg` run lasts about 20 seconds (set by `--duration 20s`) and streams verbose logs the whole time, that's expected; each run returns on its own once its 20 seconds elapse. The logs below are enough to confirm the pipeline; the optional richer set adds several more runs and takes a few minutes. +::: + +Define a small `tg` helper so each command only specifies what varies (service, severity, status, attributes): ```shell -telemetrygen logs \ +tg() { local signal=$1; shift; telemetrygen "$signal" \ --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service checkout --rate 5 --duration 30s \ - --severity-text Error --severity-number 17 --body "payment gateway timeout" \ + --rate 5 --duration 20s "$@"; } +``` + +Send logs as a realistic mix of severities across services, mostly informational with a warning and an error rather than one uniform error stream: + +```shell +tg logs --service frontend --severity-text Info --severity-number 9 --body "GET /api/products 200" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.method="GET"' --telemetry-attributes 'http.status_code="200"' +tg logs --service checkout --severity-text Warn --severity-number 13 --body "retrying payment authorization" \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes 'http.method="POST"' +tg logs --service payment --severity-text Error --severity-number 17 --body "payment gateway timeout" \ --otlp-attributes 'deployment.environment="production"' \ --telemetry-attributes 'http.status_code="500"' ``` -Send multi-span traces with child spans and an error status, which populate the Service Map and error views: +Logs alone confirm the endpoint is working. For a richer demo dataset, a multi-service Service Map and charts across metric types, expand and run the commands below as well. They reuse the `tg` helper, so run them in the same shell. + +
+Generate richer telemetry (optional) + +Send multi-span traces from several healthy services plus one failing dependency. This gives the Service Map a realistic shape, mostly healthy with one erroring service, and still populates the error views: ```shell -telemetrygen traces \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service checkout --rate 5 --duration 30s \ - --child-spans 4 --span-duration 120ms --span-links 1 --status-code Error \ +# Healthy services: the bulk of the traffic, all spans Ok +for svc in frontend checkout cart; do + tg traces --service "$svc" --child-spans 3 --span-duration 80ms --status-code Ok \ + --otlp-attributes 'deployment.environment="production"' \ + --telemetry-attributes "http.route=\"/$svc\"" +done + +# One slow dependency returning errors +tg traces --service payment --child-spans 3 --span-duration 450ms --span-links 1 --status-code Error \ --otlp-attributes 'deployment.environment="production"' \ - --telemetry-attributes 'http.route="/cart"' + --telemetry-attributes 'http.route="/charge"' ``` -Send metrics of a given type with a named series: +Send metrics across the three common types, so charts have a counter, a gauge, and a distribution: ```shell -telemetrygen metrics --metric-type Sum \ - --otlp-endpoint ${OTEL_ENDPOINT} --otlp-insecure \ - --otlp-header "authorization=\"${OTLP_AUTH_TOKEN}\"" \ - --service checkout --otlp-metric-name http.server.requests \ - --aggregation-temporality cumulative --rate 5 --duration 30s +tg metrics --service frontend --metric-type Sum --otlp-metric-name http.server.requests --aggregation-temporality cumulative +tg metrics --service frontend --metric-type Gauge --otlp-metric-name system.memory.usage +tg metrics --service payment --metric-type Histogram --otlp-metric-name http.server.duration +``` + +For even more variety, add a couple more services and an exponential histogram: + +```shell +tg traces --service catalog --child-spans 2 --span-duration 60ms --status-code Ok \ + --otlp-attributes 'deployment.environment="production"' --telemetry-attributes 'http.route="/catalog"' +tg traces --service inventory --child-spans 2 --span-duration 220ms --status-code Ok \ + --otlp-attributes 'deployment.environment="staging"' --telemetry-attributes 'http.route="/inventory"' +tg metrics --service payment --metric-type ExponentialHistogram --otlp-metric-name db.query.duration ``` -For the full set of flags, variations across multiple services and metric types, and verification tips, see [Synthetic data with telemetrygen](/use-cases/observability/clickstack/getting-started/telemetrygen). +
+ +For the full set of flags and more variations, see [Synthetic data with telemetrygen](/use-cases/observability/clickstack/getting-started/telemetrygen). ## Confirm in the ClickStack UI {#confirm-in-ui-existing} @@ -337,10 +405,20 @@ For the full set of flags, variations across multiple services and metric types, +## Next steps: send your own data {#next-steps} + +The synthetic burst above only proves the pipeline works. To start sending real telemetry, instrument your own services with the [ClickStack SDKs](/use-cases/observability/clickstack/sdks), which provide instrumentation for Node.js, Python, Go, Java, and other languages that export OTLP to the collector endpoint you just verified. For a complete worked example, follow [Instrument an application](/use-cases/observability/clickstack/instrument-application). + +To collect from infrastructure rather than application code: + +- [Monitoring Kubernetes](/use-cases/observability/clickstack/monitoring-kubernetes): collect logs, infrastructure metrics, and Kubernetes events from a cluster. +- [Monitoring AWS CloudWatch logs](/use-cases/observability/clickstack/monitoring-aws-cloudwatch-logs): forward CloudWatch logs via the OpenTelemetry CloudWatch receiver. + ## Further reading {#further-reading} This guide covers a single collector instance in its simplest form. The [OpenTelemetry collector reference](/use-cases/observability/clickstack/ingesting-data/otel-collector) covers what to do next: +- [Deploying the collector in Kubernetes](/use-cases/observability/clickstack/ingesting-data/otel-collector#configuring-the-collector) with the upstream OpenTelemetry Helm chart and the ClickStack collector image. - [Securing the collector](/use-cases/observability/clickstack/ingesting-data/otel-collector#securing-the-collector) with TLS on the OTLP endpoint and least-privilege ingestion users. - [Processing, filtering, and enriching](/use-cases/observability/clickstack/ingesting-data/otel-collector#processing-filtering-transforming-enriching) events at the gateway. - [Extending the collector configuration](/use-cases/observability/clickstack/ingesting-data/otel-collector#extending-collector-config) with custom receivers, processors, and pipelines.