Skip to content

KIT-IRS/PLCAgentBenchmark

Repository files navigation

PLCAgentBenchmark

Official repository for the paper:

Benchmarking Agentic AI Architectures for Automation Engineering


Abstract

Developing automation software requires not only control-logic generation but also hardware configuration, I/O mapping, and interaction with vendor-specific engineering tools. Despite the success of large language models (LLMs) and agentic AI in conventional software engineering, their utility for end-to-end automation engineering is under-explored. This paper compares seven architectures, ranging from a manual LLM baseline to standardized tool-enabled single-agent and multi-agent systems with optional retrieval-augmented generation and inter-agent communication. The benchmark is evaluated on two industrial use cases and measures syntactic correctness and semantic correctness against predefined hardware test cases, tool calls, token consumption, and cost. The results show that explicit orchestration and in particular access to the vendor-specific engineering tool substantially improves functional performance over plain LLM baselines. The single-agent architecture offers the most favorable balance between correctness and resources, whereas multi-agent extensions increase coordination overhead without consistent performance gains. These findings indicate that lightweight agent designs are currently a practical choice for industrial automation engineering, while standardized tool and inter-agent protocols remain promising for modularity and extensibility.

Keywords: IEC 61131-3, Agentic AI, Benchmarking, Code Generation, Tool Integration, MCP, A2A Communication


Repository Structure

PLCAgentBenchmark/
├── A0-Client/               # Architecture 0: Manual LLM baseline (client-side)
├── A1-Client-MCP/           # Architecture 1: Baseline + MCP tool access
├── A2-Single-Agent/         # Architecture 2: Single-agent with tool integration
├── A3-Multi-Agent/          # Architecture 3: Multi-agent orchestration
├── A4-Multi-Agent-RAG/      # Architecture 4: Multi-agent + Retrieval-Augmented Generation
├── A5-Multi-Agent-A2A/      # Architecture 5: A3 with Agent-to-Agent (A2A) communication
├── A6-Multi-Agent-A2A/      # Architecture 6: A4 with Agent-to-Agent (A2A) communication
├── Error_Pattern/           # Common error patterns observed during testing and hardware test documentation
├── Results/                 # TwinCAT files, logs, prompts, and evaluation data
├── Scripts/                 # Utility and evaluation scripts
├── TwinCAT_Projekt_Leck/    # TwinCAT reference project (hand-crafted): use case 1 — leak detection system
├── TwinCAT_Projekt_RS232/   # TwinCAT reference project (hand-crafted): use case 2 — RS232 dispenser integration
└── VectorRAG/               # Vector index toolkit for TF6340 documentation (semantic search)

Each architecture folder (A0–A6) is self-contained and includes its own requirements.txt and .env.example.


Prerequisites

  • Python 3.10+
  • Windows (PowerShell assumed; paths use \)
  • A running TwinCAT 3 installation for tool-enabled architectures (A1–A6)
  • API credentials for the LLM provider used (see .env.example in each architecture folder)
  • For RAG-enabled architectures (A4, A6): a pre-built vector index (see RAG Setup)

Setup

Each architecture uses an isolated Python virtual environment. The following steps apply to any architecture folder (example: A4-Multi-Agent-RAG):

cd A4-Multi-Agent-RAG
python -m venv .venv
.\.venv\Scripts\Activate
pip install -r requirements.txt
copy .env.example .env
# Open .env and set OPENAI_API_KEY, OPENAI_MODEL, OPENAI_BASE_URL (or equivalent)

Running Experiments

Start scripts launch all required components (orchestrator, proxies, subagents) for each architecture.

Architectures Start command
A0 – A4 python multi_agent_start.py
A5 – A6 python start_all.py

After a test run, logs are written into the proxy subfolders within the architecture directory. Example path: orch_proxy\logs\conversations\Orch_Agent.json.


RAG Setup

RAG-enabled architectures (A4, A6) require a pre-built vector index before running experiments.

Vector index (semantic search over TF6340 serial communication documentation):

See VectorRAG/README_EN.md for full instructions. Quick build (run from the VectorRAG folder):

# Example: create a venv, install deps and build the index
cd VectorRAG
python -m venv .venv
.\.venv\Scripts\Activate
pip install -r requirements.txt
python .\vectorRAG_tool.py .\TF6340_kurz_DE.md --api-key YOUR_KEY

OPENAI_API_KEY and OPENAI_MODEL must be set in the environment before building the index.


Proxies

All architectures use local proxy components (orch_proxy, io_proxy, plc_proxy, rag_proxy) that forward and log requests to the upstream AI toolbox. For consolidated proxy configuration and usage instructions, see Scripts/PROXIES_README.md.


Log Analysis & Result Aggregation

The Scripts/ folder contains scripts for extracting metrics from experiment logs and aggregating results across test runs.

Analyze a single run

# For OpenAI-based architectures:
python Scripts\proxy_log_analyzer.py path\to\conversation.json

# For Claude-based architectures:
python Scripts\proxy_log_analyzer_claude.py path\to\conversation.json

Each analyzed log produces three output files in the same directory as the input log:

File Content
<logstem>_analysis.txt Human-readable report (tokens, cost, duration, tool calls)
VALUES_<logstem>.csv Numeric summary for spreadsheet import
TOOLS_<logstem>.csv Per-tool call counts

Batch analysis (all architectures)

python Scripts\run_all_proxy_analyses.py
# or with an explicit root directory:
python Scripts\run_all_proxy_analyses.py --root "C:\path\to\workspace"

This script walks the workspace, detects all log directories, and runs the appropriate analyzer (OpenAI or Claude variant) automatically.

Aggregate per-architecture results

python Scripts\aggregate_architecture_combined.py --root .

Expects test folders named 1_Test/, 2_Test/, etc. Produces combined_architecture_summary.csv in each architecture folder.


TwinCAT POU Converter

The utility Scripts\convert_main_to_fb_serialcom.py converts a TwinCAT MAIN.TcPOU program into a reusable FUNCTION_BLOCK (FB_SerialCom). This was necessary because the EtherCAT hardware configuration changed during the benchmark, requiring the main program to be refactored into a function block to remain portable across different I/O topologies while preserving the control logic.

python Scripts\convert_main_to_fb_serialcom.py path\to\MAIN.TcPOU

Output: MAIN.converted.TcPOU (or FB_SerialCom.TcPOU when renaming is enabled).


Recommended Workflow

  1. Set up environment — create .venv, install requirements, configure .env
  2. Build knowledge base — run VectorRAG if testing A4 or A6
  3. Run experiments — use the start script for the target architecture
  4. Analyze logs — run the single-file or batch analyzer
  5. Aggregate results — run aggregate_architecture_combined.py per architecture

Citation

If you use this benchmark or build upon this work, please cite:

@article{buehlmann2025benchmarking,
  title   = {Benchmarking Agentic {AI} Architectures for Automation Engineering},
  author  = {B{\"u}hlmann, Ilona and Madsen, Marwin and Sp{\"a}th, Christian and
             Pfetzinger, Fabian and Maurer, Frank and Barth, Mike},
  journal = {<Venue>},
  year    = {2026},
}

License

This repository is released under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages