Skip to content

annabellscha/mcp-code-execution-agent-tracing

Repository files navigation

Observing Code Execution Agents with MCP

Code execution with MCP is powerful—agents that write code to call tools directly can be 100x cheaper and handle workflows that would blow past context limits. But there's a gap many teams miss: observability doesn't come for free.

This example demonstrates the Traced Gateway Pattern—a simple instrumentation layer that gives you full visibility without sacrificing efficiency.

What This Example Adds Beyond the Anthropic Article

The Anthropic article Code execution with MCP explains how to make agents more efficient. What it doesn't cover is what happens when you lose visibility into your agent's operations.

The Anthropic article teaches:

  • Progressive disclosure (tools as files on filesystem)
  • On-demand tool loading
  • Code generation for tool calling
  • Context efficiency through local data processing

This example adds:

  • The Traced Gateway Pattern for instrumenting code execution
  • How to trace what happens inside the "black box"
  • A toggle (--no-mcp-tracing) to demonstrate the observability gap

What's Real vs. Simulated

This demo simulates the MCP pattern for educational purposes. Here's what's actually happening:

Real:

  • Filesystem exploration of servers/ directory (reads actual local files)
  • LLM code generation via Anthropic API (real Claude call)
  • Langfuse tracing (creates real observable traces)
  • The execution pattern and architecture

Simulated/Mocked:

  • MCP tool implementations - no actual MCP server connections
  • Tool data - get_sheet.py generates random mock data (100 fake rows) with time.sleep() to simulate API latency
  • No actual Google Drive or Salesforce API calls

The purpose is to demonstrate the pattern - progressive disclosure, on-demand tool loading, code generation, and local data processing - without requiring real external service credentials. The observability instrumentation is fully real and shows exactly how you would trace a production implementation.

The Observability Problem

With traditional tool calling, every operation flows through the LLM:

LLM → Tool Call → Result → LLM → Tool Call → Result → LLM
        ↑                    ↑                    ↑
    (visible)            (visible)           (visible)

Your tracing tool sees everything. With code execution:

LLM → Generate Code → [SANDBOX EXECUTION] → Summary
                            ↑
                    (invisible black box)

You've traded visibility for efficiency. And that trade has real consequences:

  • Silent runaway costs — An agent with a bug calls Salesforce 1,000 times instead of 10. You don't notice until the API bill arrives.
  • Undetectable regressions — Tool usage patterns shift after a prompt change, but your evals only check final outputs.
  • Unanswerable questions — Finance asks why costs spiked. Your traces show "code_execution: success" with no children.
  • Impossible debugging — A user reports wrong records updated. The trace shows success but no details about what actually happened.

This isn't a minor inconvenience—it's a production risk.

The Solution: The Traced Gateway Pattern

The fix is simple: enforce a traced gateway at the execution boundary. Every MCP tool call must pass through an instrumented wrapper:

def call_mcp_tool(tool_name: str, input_data: dict) -> dict:
    """Every MCP call passes through here—with tracing."""
    with langfuse.start_as_current_observation(
        as_type="tool",
        name=f"mcp.{tool_name}",
        input=input_data,
    ) as span:
        result = TOOL_IMPLEMENTATIONS[tool_name](input_data)
        span.update(output={"success": True})
        return result

The agent's generated code calls call_mcp_tool(), which is injected into the sandbox. No direct tool access—everything flows through the Traced Gateway.

What the Traced Gateway Reveals

With the pattern implemented, your traces show the full hierarchy:

Agent: efficient-mcp-workflow
├─ Tool: explore_servers              ← Progressive disclosure
├─ Tool: read_tool_definition         ← On-demand loading (x2)
├─ Generation: generate_code          ← LLM generates code
└─ Span: code_execution               ← Sandbox execution
    ├─ Tool: mcp.google_drive.get_sheet   ← MCP tool VISIBLE!
    └─ Tool: mcp.salesforce.batch_update  ← MCP tool VISIBLE!

Without instrumentation, the code_execution span would show no children.

Quick Start

# Install dependencies
pip install langfuse anthropic python-dotenv

# Configure environment
cp .env.example .env
# Edit .env with your API keys:
#   LANGFUSE_SECRET_KEY=sk-lf-...
#   LANGFUSE_PUBLIC_KEY=pk-lf-...
#   ANTHROPIC_API_KEY=sk-ant-...

# Run with full tracing (default)
python main.py

# Run WITHOUT MCP tracing to see the black box
python main.py --no-mcp-tracing

Key Learnings

1. The Traced Gateway is Non-Negotiable

Every MCP tool must be called through the instrumented gateway. No exceptions:

# mcp_client.py
def call_mcp_tool(tool_name: str, input_data: dict) -> dict:
    with langfuse.start_as_current_observation(
        as_type="tool",
        name=f"mcp.{tool_name}",
        input=input_data,
    ) as span:
        # ... execute and trace

2. Context Propagation Through Sandbox

The sandbox execution environment needs the tracing context. We inject call_mcp_tool into the exec globals:

exec_globals = {
    "__builtins__": __builtins__,
    "call_mcp_tool": call_mcp_tool,  # Injected with tracing
}
exec(generated_code, exec_globals)

3. Hierarchical Spans Show Causality

Wrapping code execution in a parent span ensures MCP calls nest correctly:

with langfuse.start_as_current_observation(name="code_execution"):
    exec(code, exec_globals)  # MCP calls inside become children

4. Toggle Tracing to Prove the Gap

The --no-mcp-tracing flag demonstrates what happens without instrumentation:

# With tracing
code_execution
├─ mcp.google_drive.get_sheet
└─ mcp.salesforce.batch_update

# Without tracing
code_execution
└─ (nothing visible)

File Structure

mcp-tracing-example/
├── main.py              # Workflow demonstrating all 4 steps
├── mcp_client.py        # Instrumented MCP client + code execution
├── servers/             # MCP tools as explorable files
│   ├── google_drive/
│   │   ├── get_sheet.py
│   │   └── list_files.py
│   └── salesforce/
│       ├── batch_update.py
│       └── update_record.py
├── .env.example
└── README.md

Token Efficiency (for context)

The Anthropic article's efficiency claims hold:

Approach Tokens Why
Traditional ~11,300 All tools + all data through model
Code Execution ~1,500 Only needed tools + summary only

Savings: ~87% - and with this example, you maintain full observability.

Key Takeaway

MCP code execution makes agents cheaper and more powerful. But observability doesn't come for free—you have to re-introduce it deliberately at the execution boundary.

If code executes outside the LLM, tracing must move there too.

The Traced Gateway pattern solves this. Efficiency and observability aren't a trade-off—you can have both.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages