Improve guardrail: structured outputs + LLM-generated user messages by dcschreiber · Pull Request #46 · Sefaria/ai-chatbot

dcschreiber · 2026-02-24T13:44:39Z

We changed the Guardrail prompt to also return a message. that message is returned to the user when the Guardrail blocks.

Claude:

Summary

Structured outputs for guardrail classifier: Uses output_config with a JSON schema to guarantee valid {decision, reason, message} responses from the API, eliminating parsing edge cases (markdown fences, malformed JSON).
LLM-generated user-facing messages: The guardrail prompt now produces a message field that is sent directly to the user when blocked, replacing the hardcoded rejection text. This allows the response to be contextual and more helpful.
Simplified error handling: Removed strip_markdown_fences, hardcoded rejection constants (GUARDRAIL_REJECTION_MESSAGE, GUARDRAIL_UNAVAILABLE_REASON, etc.), and the conditional logic in claude_service.py that chose between generic and reason-based rejections.

Changed files

server/chat/V2/guardrail/guardrail_service.py — Added GUARDRAIL_OUTPUT_SCHEMA, output_config in API call, message field on GuardrailResult, simplified _parse_response
server/chat/V2/agent/claude_service.py — Removed rejection message formatting logic; uses guardrail_result.message directly
server/chat/V2/prompts/prompt_fragments.py — Removed unused guardrail constants
server/chat/tests/test_guardrail_service.py — Updated tests for new schema; added test_output_config_passed_to_api and test_reason_field_preserved

Test plan

Unit tests pass (pytest server/chat/tests/test_guardrail_service.py)
Verify on staging that blocked messages show contextual LLM-generated responses
Confirm fail-closed behavior still works (Braintrust down, API errors)

🤖 Generated with Claude Code

…ction The guardrail LLM now returns {decision, reason, message} where message is sent directly to the user and reason is kept for logging/tracing. Removes hardcoded GUARDRAIL_REJECTION_MESSAGE and related constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replaces prompt-based JSON enforcement with Anthropic's output_config JSON schema, guaranteeing valid responses with constrained decision values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coolify-sefaria-github · 2026-02-24T13:44:45Z

The preview deployment for sefaria/ai-chatbot:client is ready. 🟢

Open Preview | Open Build Logs | Open Application Logs

Last updated at: 2026-02-24 13:45:08 CET

dcschreiber and others added 2 commits February 24, 2026 12:49

feat: use structured outputs (output_config) for guardrail classifier

aaa1d68

Replaces prompt-based JSON enforcement with Anthropic's output_config JSON schema, guaranteeing valid responses with constrained decision values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

saengel approved these changes Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve guardrail: structured outputs + LLM-generated user messages#46

Improve guardrail: structured outputs + LLM-generated user messages#46
dcschreiber wants to merge 2 commits intomainfrom
feat/suicide-guardrail-improvment

dcschreiber commented Feb 24, 2026 •

edited

Loading

Uh oh!

coolify-sefaria-github bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dcschreiber commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changed files

Test plan

Uh oh!

coolify-sefaria-github bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dcschreiber commented Feb 24, 2026 •

edited

Loading

coolify-sefaria-github bot commented Feb 24, 2026 •

edited

Loading