TrustOps(ThinkSync) 🛡️

title	TrustOps
emoji	🛡️
colorFrom	blue
colorTo	indigo
sdk	docker
app_port	7860
app_file	app.py
pinned	false

TrustOps(ThinkSync) 🛡️

A programmatic content moderation reinforcement learning environment built to the OpenEnv specification. Agents are evaluated on three progressively complex moderation tasks.

🚀 Quick Start

pip install -r requirements.txt

# Set your HuggingFace token
export HF_TOKEN="hf_..."

# Run baseline inference (produces [START]/[STEP]/[END] logs)
python inference.py

# Or launch the Gradio observability dashboard
python app.py

🏗️ Architecture

Content Queue  →  Agent (LLM)  →  Grader  →  RewardRecord
     ↑                                              ↓
  CONTENT_BANK                              Moderation Log

Core Modules:

File	Purpose
`models.py`	Pydantic data models: `Content`, `Observation`, `Action`, `RewardRecord`, `EscalationTicket`
`engine.py`	Classification/grading pipeline + `MyEnv` (OpenEnv interface)
`grader.py`	3 task-specific grading functions with score clamping (0.01–0.99)
`tasks.py`	Task registry: `easy_detection`, `medium_classification`, `hard_contextual`
`inference.py`	Baseline agent — runs all 3 tasks sequentially via HF Router
`app.py`	Gradio observability dashboard
`openenv.yaml`	OpenEnv spec file
`Dockerfile`	Container for HF Spaces (2 vCPUs, 8GB RAM)
`master_blueprint.md`	Master Blueprint & Unified Architecture documentation

📐 Action Space

Field	Type	Description
`content_id`	`string`	References the `Content.id` being moderated
`action_type`	`enum`	One of: `approve`, `remove`, `flag`
`reasoning_chain`	`string`	Step-by-step chain-of-thought justification
`confidence_score`	`float [0.0–1.0]`	Agent self-reported confidence

Valid Actions

approve — Content is safe; publish it.
remove — Content violates policy; take it down.
flag — Content is ambiguous or requires human review.

👁️ Observation Space

Field	Type	Description
`content_queue`	`List[Content]`	Posts awaiting moderation
`moderation_log`	`List[dict]`	History of all decisions + reward breakdowns
`step_count`	`int`	Steps completed in current episode
`cumulative_reward`	`float`	Running reward total
`episode_active`	`bool`	Whether the episode is still running

🏆 Reward Function

Reward for each step is decomposed into:

Component	Score	Condition
Classification Score	`+0.5`	Agent action matches expected action
Action Score	`+0.3`	Exact correct action taken
Reasoning Score	`+0.0 – +0.2`	Quality of chain-of-thought (partial credit)
False Negative Penalty	`–0.2`	Approved harmful content
False Positive Penalty	`–0.1`	Removed safe content

Range: [0.01, 0.99] per step (clamped — never exactly 0.0 or 1.0)

Partial Progress Signals

Easy Task: +0.2 for identifying toxic keywords in reasoning
Medium Task: +0.1 for capturing sarcasm/context cues in reasoning
Hard Task: +0.1 for identifying coded language or radicalization patterns

📋 Tasks

Task 1 — Simple Toxicity Detection (Easy)

Classify clear-cut spam and direct abusive content.

E001: "BUY CHEAP FOLLOWERS NOW!!! 💰💰💰" → remove (SPAM)
E002: "Had a great lunch today..." → approve (SAFE)

Task 2 — Contextual Sentiment Analysis (Medium)

Distinguish sarcastic frustration (allow) from targeted harassment (remove).

M002: "I'm going to DESTROY you in chess tonight 😂" → approve (CONTEXT-DEPENDENT)
M001: "You're a worthless excuse for a human being..." → remove (ABUSIVE)

Task 3 — Multi-Turn Moderation / Coded Language (Hard)

Detect subtle grooming, radicalization patterns, and coded language.

H002: "Those people have been poisoning the well..." → flag (CODED-LANGUAGE)
H003: "LEAKED: Internal documents show executives covered up..." → flag (WHISTLEBLOWER)

📊 Baseline Scores

Task	Difficulty	Baseline Agent	Avg Reward/Step
Toxicity Detection	Easy	Qwen2.5-72B	~0.80
Contextual Sentiment	Medium	Qwen2.5-72B	~0.65
Multi-Turn / Coded	Hard	Qwen2.5-72B	~0.45
Overall	All 12 tasks	Qwen2.5-72B	~0.62

🖥️ OpenEnv Interface (`MyEnv`)

from engine import MyEnv
from models import Action, ActionType

env = MyEnv()
obs = env.reset()

action = Action(
    content_id="E001",
    action_type=ActionType.REMOVE,
    reasoning_chain="Clear spam: ALL-CAPS, excessive emojis, suspicious link.",
    confidence_score=0.95
)

obs, reward, done, info = env.step(action)
print(reward)  # e.g., 0.8

🐳 Docker

docker build -t trustops-env .
docker run -e HF_TOKEN=$HF_TOKEN --cpus 2 --memory 8g trustops-env

🔐 Security

HF token read via os.getenv("HF_TOKEN") — never hard-coded in production.
Fixed: Removed the flag-everything hack in engine.py:264 that allowed agents to cheat by flagging all tasks for unearned +0.5 classification scores.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
server		server
.app_pid		.app_pid
.dockerignore		.dockerignore
.gitignore		.gitignore
Businesscontext.md		Businesscontext.md
Dockerfile		Dockerfile
EvaluationandResearch.md		EvaluationandResearch.md
LICENSE		LICENSE
README.md		README.md
Technical_Architecture.md		Technical_Architecture.md
UI.md		UI.md
app.log		app.log
app.py		app.py
app_log.txt		app_log.txt
config.py		config.py
core_concept.md		core_concept.md
developer_summary.md		developer_summary.md
engine.py		engine.py
errors.txt		errors.txt
grader.py		grader.py
inference.py		inference.py
master_blueprint.md		master_blueprint.md
models.py		models.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
securityandportability.md		securityandportability.md
tasks.py		tasks.py
test_system.py		test_system.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrustOps(ThinkSync) 🛡️

🚀 Quick Start

🏗️ Architecture

📐 Action Space

Valid Actions

👁️ Observation Space

🏆 Reward Function

Partial Progress Signals

📋 Tasks

Task 1 — Simple Toxicity Detection (Easy)

Task 2 — Contextual Sentiment Analysis (Medium)

Task 3 — Multi-Turn Moderation / Coded Language (Hard)

📊 Baseline Scores

🖥️ OpenEnv Interface (`MyEnv`)

🐳 Docker

🔐 Security

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TrustOps(ThinkSync) 🛡️

🚀 Quick Start

🏗️ Architecture

📐 Action Space

Valid Actions

👁️ Observation Space

🏆 Reward Function

Partial Progress Signals

📋 Tasks

Task 1 — Simple Toxicity Detection (Easy)

Task 2 — Contextual Sentiment Analysis (Medium)

Task 3 — Multi-Turn Moderation / Coded Language (Hard)

📊 Baseline Scores

🖥️ OpenEnv Interface (MyEnv)

🐳 Docker

🔐 Security

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🖥️ OpenEnv Interface (`MyEnv`)

Packages