Skip to content

[feat-097] Production semantic search query embedding silently degraded — keyword-only fallback #778

@Kneesal

Description

@Kneesal

TL;DR

Production semantic search at https://cms.jesusfilm.org/api/search is silently degraded — the OpenRouter query-embedding call is failing (or returning non-overlapping results), leaving keyword search as the only contributing retrieval. The try/catch in apps/cms/src/api/search/services/search.ts:154-166 swallows the failure, so the API still returns 200s with results. The hybrid search promise is not being delivered in production.

Discovered while validating PR #777 (feat-086). Roadmap ticket: docs/roadmap/content-discovery/feat-097-investigate-prod-query-embedding.md.

Evidence

Tested 6 queries against production on 2026-04-15:

Query Top score Has scene-level data?
Easter 0.500 No (startSeconds: null, playbackId: null)
forgiveness 0.500 No
Jesus heals 0.500 No
resurrection 0.500 No
centurion at the cross empty N/A
feeling alone in suffering empty N/A

The score 0.500 is mathematically the exact value for rank-1 in keyword search when 2 lists are passed to RRF and semantic is empty:

score = (1/(k+1)) / (lists.length / (k+1))
      = (1/61)   / (2/61)
      = 0.500

If semantic were contributing AND ranked the same items at rank-1, scores would be 1.000. They are not. The thematic-only queries that should have zero keyword matches return empty — strong evidence that semantic isn't producing usable results.

For comparison, the same code paths run against a local DB return rich semantic results: themes (new life, awe, meaning), bible verses, demographics, scene-level snippets. Production returns none of this. The code is identical — only the runtime environment differs.

Hypotheses (Ranked)

  1. OPENROUTER_API_KEY env var missing or invalid in Railway. Most likely. The try/catch logs strapi.log.warn(...) which may be filtered out of production retention. (apps/cms/src/api/search/services/search.ts:154-166 swallows the failure, apps/cms/src/lib/openrouter.ts requires the env var.)
  2. OpenRouter API outage or throttling on the production IP.
  3. Model deprecation — text-embedding-3-small renamed/removed.
  4. Network egress blocked Railway → OpenRouter.
  5. Semantic returning data, but for different videos that never make top-5 (unlikely given empty thematic results).

Investigation Plan

1. Confirm from logs

```bash
railway logs --service forge-cms | grep -E '(embedding failed|OPENROUTER|[search])'
```

If [search] Query embedding failed, falling back to keyword-only: ... appears on every query, hypothesis 1 is confirmed.

2. Verify the env var

```bash
railway variables --service forge-cms | grep OPENROUTER
```

If missing, set from Doppler. If present, validate it works:

```bash
curl -X POST https://openrouter.ai/api/v1/embeddings \
-H "Authorization: Bearer $OPENROUTER_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": "test", "model": "text-embedding-3-small"}'
```

3. Make the failure visible going forward

The current try/catch is too quiet. It should still degrade gracefully (don't break the API) but should also:

  • Log at `error` level, not `warn`, so it surfaces in default log retention
  • Increment a metric (e.g., `search_query_embedding_failures_total`) so degraded operation triggers alerts
  • Optionally surface a non-blocking signal in the API response (`degraded: true` flag, or `X-Search-Mode: keyword-only` header) so consumers can render a banner

4. Add a synthetic health probe

Run a single test embedding at boot or periodically; report failure to monitoring. Catches regressions before users notice.

```ts
// apps/cms/src/bootstrap/probe-openrouter.ts
async function probeOpenRouter(strapi: Core.Strapi): Promise {
try {
await embedQuery("health-check probe")
strapi.log.info("[probe] OpenRouter embedding healthy")
} catch (err) {
strapi.log.error(`[probe] OpenRouter embedding FAILED: ${err}`)
}
}
```

Verification (Once Fixed)

```bash

Thematic-only query should return non-empty (semantic kicks in)

curl 'https://cms.jesusfilm.org/api/search?q=feeling%20alone%20in%20suffering&locale=en' | jq '.results | length'

Expect: > 0

Top result should have scene-level data

curl 'https://cms.jesusfilm.org/api/search?q=Easter&locale=en' | jq '.results[0]'

Expect: startSeconds and playbackId non-null

Expect: snippet contains scene-level themes/bible-verses prose

Top score should NOT be exactly 0.500

curl 'https://cms.jesusfilm.org/api/search?q=Easter&locale=en' | jq '.results[0].score'

Expect: ~1.0 (rank-1 in both lists) or ~0.95+ (rank-1 in one, rank-2 in the other)

```

Constraints

  • Keep graceful degradation. Search must continue returning 200 with keyword results when semantic fails. Don't introduce a hard 503.
  • No model swap or provider replacement. Out of scope for this ticket.

Related

  • feat-010 — original semantic search API. Graceful degradation was intentional, but the silence is the trade-off being challenged here.
  • feat(cms): add experiences to search results (feat-086) #777 (feat-086) — adds experiences to search. Same embedQuery dependency means experiences will exhibit the same degraded behavior in production until this is fixed.
  • `docs/solutions/best-practices/hybrid-semantic-search-api-strapi-v5-pgvector.md` — documents the degradation strategy.
  • Roadmap ticket: `docs/roadmap/content-discovery/feat-097-investigate-prod-query-embedding.md`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions