| title | TrustOps |
|---|---|
| emoji | 🛡️ |
| colorFrom | blue |
| colorTo | indigo |
| sdk | docker |
| app_port | 7860 |
| app_file | app.py |
| pinned | false |
A programmatic content moderation reinforcement learning environment built to the OpenEnv specification. Agents are evaluated on three progressively complex moderation tasks.
pip install -r requirements.txt
# Set your HuggingFace token
export HF_TOKEN="hf_..."
# Run baseline inference (produces [START]/[STEP]/[END] logs)
python inference.py
# Or launch the Gradio observability dashboard
python app.pyContent Queue → Agent (LLM) → Grader → RewardRecord
↑ ↓
CONTENT_BANK Moderation Log
Core Modules:
| File | Purpose |
|---|---|
models.py |
Pydantic data models: Content, Observation, Action, RewardRecord, EscalationTicket |
engine.py |
Classification/grading pipeline + MyEnv (OpenEnv interface) |
grader.py |
3 task-specific grading functions with score clamping (0.01–0.99) |
tasks.py |
Task registry: easy_detection, medium_classification, hard_contextual |
inference.py |
Baseline agent — runs all 3 tasks sequentially via HF Router |
app.py |
Gradio observability dashboard |
openenv.yaml |
OpenEnv spec file |
Dockerfile |
Container for HF Spaces (2 vCPUs, 8GB RAM) |
master_blueprint.md |
Master Blueprint & Unified Architecture documentation |
| Field | Type | Description |
|---|---|---|
content_id |
string |
References the Content.id being moderated |
action_type |
enum |
One of: approve, remove, flag |
reasoning_chain |
string |
Step-by-step chain-of-thought justification |
confidence_score |
float [0.0–1.0] |
Agent self-reported confidence |
approve— Content is safe; publish it.remove— Content violates policy; take it down.flag— Content is ambiguous or requires human review.
| Field | Type | Description |
|---|---|---|
content_queue |
List[Content] |
Posts awaiting moderation |
moderation_log |
List[dict] |
History of all decisions + reward breakdowns |
step_count |
int |
Steps completed in current episode |
cumulative_reward |
float |
Running reward total |
episode_active |
bool |
Whether the episode is still running |
Reward for each step is decomposed into:
| Component | Score | Condition |
|---|---|---|
| Classification Score | +0.5 |
Agent action matches expected action |
| Action Score | +0.3 |
Exact correct action taken |
| Reasoning Score | +0.0 – +0.2 |
Quality of chain-of-thought (partial credit) |
| False Negative Penalty | –0.2 |
Approved harmful content |
| False Positive Penalty | –0.1 |
Removed safe content |
Range: [0.01, 0.99] per step (clamped — never exactly 0.0 or 1.0)
- Easy Task:
+0.2for identifying toxic keywords in reasoning - Medium Task:
+0.1for capturing sarcasm/context cues in reasoning - Hard Task:
+0.1for identifying coded language or radicalization patterns
Classify clear-cut spam and direct abusive content.
E001:"BUY CHEAP FOLLOWERS NOW!!! 💰💰💰"→ remove (SPAM)E002:"Had a great lunch today..."→ approve (SAFE)
Distinguish sarcastic frustration (allow) from targeted harassment (remove).
M002:"I'm going to DESTROY you in chess tonight 😂"→ approve (CONTEXT-DEPENDENT)M001:"You're a worthless excuse for a human being..."→ remove (ABUSIVE)
Detect subtle grooming, radicalization patterns, and coded language.
H002:"Those people have been poisoning the well..."→ flag (CODED-LANGUAGE)H003:"LEAKED: Internal documents show executives covered up..."→ flag (WHISTLEBLOWER)
| Task | Difficulty | Baseline Agent | Avg Reward/Step |
|---|---|---|---|
| Toxicity Detection | Easy | Qwen2.5-72B | ~0.80 |
| Contextual Sentiment | Medium | Qwen2.5-72B | ~0.65 |
| Multi-Turn / Coded | Hard | Qwen2.5-72B | ~0.45 |
| Overall | All 12 tasks | Qwen2.5-72B | ~0.62 |
from engine import MyEnv
from models import Action, ActionType
env = MyEnv()
obs = env.reset()
action = Action(
content_id="E001",
action_type=ActionType.REMOVE,
reasoning_chain="Clear spam: ALL-CAPS, excessive emojis, suspicious link.",
confidence_score=0.95
)
obs, reward, done, info = env.step(action)
print(reward) # e.g., 0.8docker build -t trustops-env .
docker run -e HF_TOKEN=$HF_TOKEN --cpus 2 --memory 8g trustops-env- HF token read via
os.getenv("HF_TOKEN")— never hard-coded in production. - Fixed: Removed the
flag-everythinghack inengine.py:264that allowed agents to cheat by flagging all tasks for unearned+0.5classification scores.
MIT © Anubhav741