Skip to content

Bulk event ingestion (POST /track/batch) for offline-first SDKs — IoT, mobile, edge #373

@mraj602-tohands

Description

@mraj602-tohands

TL;DR

Add a POST /track/batch endpoint that accepts multiple events in a single HTTP request, back-dated to their original timestamps. This unblocks any SDK that captures events while offline and syncs them later — IoT firmware, mobile apps in airplane mode, edge devices on a wake schedule, etc.

The problem in one sentence

Every event today costs one HTTP request, and a buffered event's __timestamp is honoured for the row but not for session derivation — so a 6-hour-old event ends up in whatever session is currently active, not the session it belonged to when it was captured.

Why this matters for IoT

A real example. A solar-powered environmental sensor wakes every 10 minutes, takes a reading, and tries to upload. The cell modem has 70% uptime in its deployment area, so often the upload fails and the reading is buffered locally. When connectivity returns, the device has 50–500 buffered readings to flush.

With OpenPanel today, the device options are:

  1. Send one HTTP request per buffered reading. 500 round-trips. Every cell radio wake-up burns ~3 seconds of full-power TX. A 500-event flush is ~25 minutes of radio-on time. Battery dies faster than the panel can recharge it. This is why IoT teams build their own analytics pipelines instead of using product analytics tools.
  2. Stamp everything with arrival time. Loses the why — you have a count, but not a meaningful timeline. Any cohort analysis, retention curve, or session-based metric is wrong because all events get bucketed into the moment of network reconnection rather than when they actually happened.
  3. Build an aggregator. Have the firmware compute its own counters and send a daily summary. Works, but defeats the point of using a product analytics tool — you are back to dashboards built on pre-aggregated data, can't slice by user property, can't backfill a new dimension without redeploying firmware.

The same shape applies to:

  • Mobile apps in airplane mode, subway, low-battery push deferral, OS-level network restrictions
  • Wearables that sync over Bluetooth bursts when paired with a phone
  • Kiosks that batch-sync on a wake schedule (e.g., overnight)
  • Server-side replays importing from a different system

Mixpanel solved this years ago — they accept up to 2000 events per /import call, dedupe by insert_id, and respect a 5-day historical window. PostHog has similar (/batch endpoint, no hard cap but recommends 1000). The fact that we don't have it is one of the bigger gaps when evaluating OpenPanel as a Mixpanel replacement for product teams that have any flavour of offline-first behaviour.

What I'm proposing

POST /track/batch
Authorization: client-id + client-secret (same as /track)
Content-Type: application/json
Body:
{
  "events": [
    { "type": "track",    "payload": { "name": "...", "properties": { "__timestamp": "...", "__deviceId": "..." } } },
    { "type": "identify", "payload": { ... } },
    { "type": "group",    "payload": { ... } },
    { "type": "increment",  "payload": { ... } },
    ...
  ]
}

Per-request limits (Mixpanel parity):

  • Up to 2000 events per request
  • Up to 10 MB uncompressed body
  • Beyond either, return 400/413

Acceptance window: events with a __timestamp up to 5 days in the past are accepted; older events are rejected per-row with reason: 'validation'. 1-minute future tolerance, beyond that the server clamps to wall-clock now (matches existing single-event behaviour).

Behaviour:

  1. Each event in the batch is processed as if sent individually through /track. Same validation, same per-type handlers (track, identify, increment, decrement, group, assign_group, replay). The alias type is rejected per-row with the same error single-event /track returns.

  2. Per-item validation failures don't fail the whole batch. Response is always 202 once auth + envelope pass:

    {
      "accepted": 1998,
      "rejected": [
        { "index": 12, "reason": "validation", "error": "payload.name: Too small: expected string to have >=1 characters" },
        { "index": 47, "reason": "validation", "error": "event timestamp older than 5 days" }
      ]
    }

    The caller can fix and retry only the bad indices instead of having to re-send 1998 good events.

  3. Per-event timestamp respected for session derivation. A batch covering 5 days of buffered readings produces the right cluster of historical sessions, with session_start rows back-dated to each event's actual timestamp. This is the part that makes the dashboard look correct after a backfill — without it, all 500 IoT readings collapse into one session at upload time, retention curves are meaningless, and any timeline-based analysis breaks.

Why now (vs. workarounds)

The argument against doing this is "users can hit /track 500 times in a loop." That's true on paper but it has three real costs that show up in production:

  • Network: 500 connection setups instead of 1. With keep-alive on the server side that's ~50ms × 500 = 25s of just TLS handshake / HTTP framing overhead, before any code runs.
  • Backpressure on the device: a 500-element queue with no batch endpoint means each event has to await its own HTTP response, or you fire-and-forget 500 requests and overwhelm the device's network stack.
  • Session attribution wrong by default: even if you handle the network, you still get the timestamp problem unless the SDK and server collaborate on deriving session_id from __timestamp. Right now the server treats wall-clock-now as the bucket key for __deviceId overrides, so a buffered event lands in whatever session is currently active for that device.

Solving all three at once with a batch endpoint that respects timestamps is much cleaner than asking SDK authors to work around them.

What this issue does NOT cover

  • Idempotent retries (insert_id / messageId-based dedup). A separate concern: what should happen when a flaky network causes the same batch to be sent twice? Two reasonable designs (Mixpanel insert_id vs Segment messageId) and the trade-offs (storage overhead, lookup cost, replay semantics) deserve their own discussion. The batch endpoint as proposed here writes both copies if you send it twice — fine for reliable networks, not safe for at-least-once retry loops.
  • Compression (gzip request bodies). Should be straightforward to add via Fastify; not blocking.
  • SDK changes. API-only for now. Once the endpoint is in, SDK PRs can adopt batch on a per-platform schedule. Web/Node SDKs probably don't need it; React Native, mobile, and any custom IoT SDK would benefit immediately.

Open questions

  1. Is 5 days the right historical window, or should it be configurable per-project? Mixpanel uses 5 days. Going longer makes the deterministic session bucket more expensive to dedup (need a wider Redis lock TTL), going shorter excludes some legitimate offline-first use cases (e.g., devices that only sync weekly).
  2. Should rejected[] items get a stable enum ('validation' | 'internal' | 'rate_limited') or freeform string? Current proposal: 'validation' | 'internal' so callers can distinguish "I sent bad data" from "your server hiccupped."
  3. Should the response include the queued deviceId / sessionId per accepted item? Currently it's just a count + rejected list. Including per-item identities would be useful for SDKs that want to update their local cache, but it doubles the response size for the common all-success case.

Acceptance criteria

  • POST /track/batch accepts up to 2000 events / 10 MB and dispatches each via the same per-type pipeline as /track.
  • Per-item validation failures don't fail the batch; response is 202 with { accepted, rejected[] }.
  • Events with historical __timestamp get a session_id derived from that timestamp (deterministic 30-min bucket), not from wall-clock now.
  • session_start is emitted exactly once per (projectId, sessionId) even when multiple workers / batches see the same bucket simultaneously.
  • Historical events do not extend the live sessionEnd job or push current-session state forward.
  • Events with __timestamp older than 5 days are rejected with a clear error.
  • Existing single-event /track behaviour is unchanged — no regressions.

I have an implementation that's been running in production against a self-hosted instance with the changes verified across 24 scenarios (IoT 7-day backlog, cross-bucket boundary, multi-device household, kiosk, concurrent batches, etc.). Opening a PR alongside this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions