PLCAgentBenchmark

Official repository for the paper:

Benchmarking Agentic AI Architectures for Automation Engineering

Abstract

Developing automation software requires not only control-logic generation but also hardware configuration, I/O mapping, and interaction with vendor-specific engineering tools. Despite the success of large language models (LLMs) and agentic AI in conventional software engineering, their utility for end-to-end automation engineering is under-explored. This paper compares seven architectures, ranging from a manual LLM baseline to standardized tool-enabled single-agent and multi-agent systems with optional retrieval-augmented generation and inter-agent communication. The benchmark is evaluated on two industrial use cases and measures syntactic correctness and semantic correctness against predefined hardware test cases, tool calls, token consumption, and cost. The results show that explicit orchestration and in particular access to the vendor-specific engineering tool substantially improves functional performance over plain LLM baselines. The single-agent architecture offers the most favorable balance between correctness and resources, whereas multi-agent extensions increase coordination overhead without consistent performance gains. These findings indicate that lightweight agent designs are currently a practical choice for industrial automation engineering, while standardized tool and inter-agent protocols remain promising for modularity and extensibility.

Keywords: IEC 61131-3, Agentic AI, Benchmarking, Code Generation, Tool Integration, MCP, A2A Communication

Repository Structure

PLCAgentBenchmark/
├── A0-Client/               # Architecture 0: Manual LLM baseline (client-side)
├── A1-Client-MCP/           # Architecture 1: Baseline + MCP tool access
├── A2-Single-Agent/         # Architecture 2: Single-agent with tool integration
├── A3-Multi-Agent/          # Architecture 3: Multi-agent orchestration
├── A4-Multi-Agent-RAG/      # Architecture 4: Multi-agent + Retrieval-Augmented Generation
├── A5-Multi-Agent-A2A/      # Architecture 5: A3 with Agent-to-Agent (A2A) communication
├── A6-Multi-Agent-A2A/      # Architecture 6: A4 with Agent-to-Agent (A2A) communication
├── Error_Pattern/           # Common error patterns observed during testing and hardware test documentation
├── Results/                 # TwinCAT files, logs, prompts, and evaluation data
├── Scripts/                 # Utility and evaluation scripts
├── TwinCAT_Projekt_Leck/    # TwinCAT reference project (hand-crafted): use case 1 — leak detection system
├── TwinCAT_Projekt_RS232/   # TwinCAT reference project (hand-crafted): use case 2 — RS232 dispenser integration
└── VectorRAG/               # Vector index toolkit for TF6340 documentation (semantic search)

Each architecture folder (A0–A6) is self-contained and includes its own requirements.txt and .env.example.

Prerequisites

Python 3.10+
Windows (PowerShell assumed; paths use \)
A running TwinCAT 3 installation for tool-enabled architectures (A1–A6)
API credentials for the LLM provider used (see .env.example in each architecture folder)
For RAG-enabled architectures (A4, A6): a pre-built vector index (see RAG Setup)

Setup

Each architecture uses an isolated Python virtual environment. The following steps apply to any architecture folder (example: A4-Multi-Agent-RAG):

cd A4-Multi-Agent-RAG
python -m venv .venv
.\.venv\Scripts\Activate
pip install -r requirements.txt
copy .env.example .env
# Open .env and set OPENAI_API_KEY, OPENAI_MODEL, OPENAI_BASE_URL (or equivalent)

Running Experiments

Start scripts launch all required components (orchestrator, proxies, subagents) for each architecture.

Architectures	Start command
A0 – A4	`python multi_agent_start.py`
A5 – A6	`python start_all.py`

After a test run, logs are written into the proxy subfolders within the architecture directory. Example path: orch_proxy\logs\conversations\Orch_Agent.json.

RAG Setup

RAG-enabled architectures (A4, A6) require a pre-built vector index before running experiments.

Vector index (semantic search over TF6340 serial communication documentation):

See VectorRAG/README_EN.md for full instructions. Quick build (run from the VectorRAG folder):

# Example: create a venv, install deps and build the index
cd VectorRAG
python -m venv .venv
.\.venv\Scripts\Activate
pip install -r requirements.txt
python .\vectorRAG_tool.py .\TF6340_kurz_DE.md --api-key YOUR_KEY

OPENAI_API_KEY and OPENAI_MODEL must be set in the environment before building the index.

Proxies

All architectures use local proxy components (orch_proxy, io_proxy, plc_proxy, rag_proxy) that forward and log requests to the upstream AI toolbox. For consolidated proxy configuration and usage instructions, see Scripts/PROXIES_README.md.

Log Analysis & Result Aggregation

The Scripts/ folder contains scripts for extracting metrics from experiment logs and aggregating results across test runs.

Analyze a single run

# For OpenAI-based architectures:
python Scripts\proxy_log_analyzer.py path\to\conversation.json

# For Claude-based architectures:
python Scripts\proxy_log_analyzer_claude.py path\to\conversation.json

Each analyzed log produces three output files in the same directory as the input log:

File	Content
`<logstem>_analysis.txt`	Human-readable report (tokens, cost, duration, tool calls)
`VALUES_<logstem>.csv`	Numeric summary for spreadsheet import
`TOOLS_<logstem>.csv`	Per-tool call counts

Batch analysis (all architectures)

python Scripts\run_all_proxy_analyses.py
# or with an explicit root directory:
python Scripts\run_all_proxy_analyses.py --root "C:\path\to\workspace"

This script walks the workspace, detects all log directories, and runs the appropriate analyzer (OpenAI or Claude variant) automatically.

Aggregate per-architecture results

python Scripts\aggregate_architecture_combined.py --root .

Expects test folders named 1_Test/, 2_Test/, etc. Produces combined_architecture_summary.csv in each architecture folder.

TwinCAT POU Converter

The utility Scripts\convert_main_to_fb_serialcom.py converts a TwinCAT MAIN.TcPOU program into a reusable FUNCTION_BLOCK (FB_SerialCom). This was necessary because the EtherCAT hardware configuration changed during the benchmark, requiring the main program to be refactored into a function block to remain portable across different I/O topologies while preserving the control logic.

python Scripts\convert_main_to_fb_serialcom.py path\to\MAIN.TcPOU

Output: MAIN.converted.TcPOU (or FB_SerialCom.TcPOU when renaming is enabled).

Recommended Workflow

Set up environment — create .venv, install requirements, configure .env
Build knowledge base — run VectorRAG if testing A4 or A6
Run experiments — use the start script for the target architecture
Analyze logs — run the single-file or batch analyzer
Aggregate results — run aggregate_architecture_combined.py per architecture

Citation

If you use this benchmark or build upon this work, please cite:

@article{buehlmann2025benchmarking,
  title   = {Benchmarking Agentic {AI} Architectures for Automation Engineering},
  author  = {B{\"u}hlmann, Ilona and Madsen, Marwin and Sp{\"a}th, Christian and
             Pfetzinger, Fabian and Maurer, Frank and Barth, Mike},
  journal = {<Venue>},
  year    = {2026},
}

License

This repository is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PLCAgentBenchmark

Abstract

Repository Structure

Prerequisites

Setup

Running Experiments

RAG Setup

Proxies

Log Analysis & Result Aggregation

Analyze a single run

Batch analysis (all architectures)

Aggregate per-architecture results

TwinCAT POU Converter

Recommended Workflow

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
A0-Client		A0-Client
A1-Client-MCP		A1-Client-MCP
A2-Single-Agent		A2-Single-Agent
A3-Multi-Agent		A3-Multi-Agent
A4-Multi-Agent-RAG		A4-Multi-Agent-RAG
A5-Multi-Agent-A2A		A5-Multi-Agent-A2A
A6-Multi-Agent-A2A		A6-Multi-Agent-A2A
Error_Pattern		Error_Pattern
Results		Results
Scripts		Scripts
TwinCAT_Projekt_Leck		TwinCAT_Projekt_Leck
TwinCAT_Projekt_RS232		TwinCAT_Projekt_RS232
VectorRAG		VectorRAG
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

PLCAgentBenchmark

Abstract

Repository Structure

Prerequisites

Setup

Running Experiments

RAG Setup

Proxies

Log Analysis & Result Aggregation

Analyze a single run

Batch analysis (all architectures)

Aggregate per-architecture results

TwinCAT POU Converter

Recommended Workflow

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages