Skip to content

Improve guardrail: structured outputs + LLM-generated user messages#46

Open
dcschreiber wants to merge 2 commits intomainfrom
feat/suicide-guardrail-improvment
Open

Improve guardrail: structured outputs + LLM-generated user messages#46
dcschreiber wants to merge 2 commits intomainfrom
feat/suicide-guardrail-improvment

Conversation

@dcschreiber
Copy link
Contributor

@dcschreiber dcschreiber commented Feb 24, 2026

We changed the Guardrail prompt to also return a message. that message is returned to the user when the Guardrail blocks.

Claude:

Summary

  • Structured outputs for guardrail classifier: Uses output_config with a JSON schema to guarantee valid {decision, reason, message} responses from the API, eliminating parsing edge cases (markdown fences, malformed JSON).
  • LLM-generated user-facing messages: The guardrail prompt now produces a message field that is sent directly to the user when blocked, replacing the hardcoded rejection text. This allows the response to be contextual and more helpful.
  • Simplified error handling: Removed strip_markdown_fences, hardcoded rejection constants (GUARDRAIL_REJECTION_MESSAGE, GUARDRAIL_UNAVAILABLE_REASON, etc.), and the conditional logic in claude_service.py that chose between generic and reason-based rejections.

Changed files

  • server/chat/V2/guardrail/guardrail_service.py — Added GUARDRAIL_OUTPUT_SCHEMA, output_config in API call, message field on GuardrailResult, simplified _parse_response
  • server/chat/V2/agent/claude_service.py — Removed rejection message formatting logic; uses guardrail_result.message directly
  • server/chat/V2/prompts/prompt_fragments.py — Removed unused guardrail constants
  • server/chat/tests/test_guardrail_service.py — Updated tests for new schema; added test_output_config_passed_to_api and test_reason_field_preserved

Test plan

  • Unit tests pass (pytest server/chat/tests/test_guardrail_service.py)
  • Verify on staging that blocked messages show contextual LLM-generated responses
  • Confirm fail-closed behavior still works (Braintrust down, API errors)

🤖 Generated with Claude Code

dcschreiber and others added 2 commits February 24, 2026 12:49
…ction

The guardrail LLM now returns {decision, reason, message} where message
is sent directly to the user and reason is kept for logging/tracing.
Removes hardcoded GUARDRAIL_REJECTION_MESSAGE and related constants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces prompt-based JSON enforcement with Anthropic's output_config
JSON schema, guaranteeing valid responses with constrained decision values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coolify-sefaria-github
Copy link

coolify-sefaria-github bot commented Feb 24, 2026

The preview deployment for sefaria/ai-chatbot:client is ready. 🟢

Open Preview | Open Build Logs | Open Application Logs

Last updated at: 2026-02-24 13:45:08 CET

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants