Retrieval-Augmented Reflexion: Retrieval-Aided Language Agents with Verbal Reinforcement Learning

Overview

We extend the Reflexion framework with Retrieval-Augmented Reflexion (RAR), a novel strategy that retrieves semantically similar past trajectories from a persistent episodic memory store, diversifies them via Maximum Marginal Relevance (MMR), and uses them as contrastive context when generating reflections. This allows the agent to learn from a broader set of past experiences — both failures and successes — rather than relying solely on the most recent attempt.

We implement and evaluate five strategies across three tasks:

Strategy	Description
Simple	Single generation attempt, no reflection or memory (HumanEval only)
ReAct	No memory, no reflection. Agent attempts each task from scratch every trial
CoT + GT	Chain-of-thought with ground truth context injected (Wikipedia passage or docstring)
Reflexion	Standard Reflexion with last 3 reflections stored in memory
ExpeL	Two-phase: gather trajectories via Reflexion, extract insights, evaluate with injected insights
RAR (ours)	Retrieves top-k past trajectories via semantic similarity and error-class matching, diversified via MMR, used as contrastive reflection context

Setup

Prerequisites

Python 3.9+
An OpenAI-compatible API key

Environment variable

export OPENAI_API_KEY=<your key>

HotPotQA (Reasoning)

Each experiment runs 100 randomly sampled questions from the HotPotQA distractor dataset across 5 trials.

git clone https://github.com/USD-AI-ResearchLab/reflexion.git
cd hotpotqa_runs
pip install -r requirements.txt
cd experiments

Run scripts

python ReactQA.py          # ReAct baseline
python CoTQA.py            # CoT + Ground Truth
python ReflexionQA.py      # Standard Reflexion
python RetrievalQA.py      # RAR (ours)
python ExpelQA.py         # ExpeL baseline

Results are saved to ../root/<strategy>/.

ALFWorld (Sequential Decision-Making)

Each experiment runs 134 household environments across 10 trials.

cd alfworld_runs
pip install -r requirements.txt

Run scripts

python main.py --strategy base               --num_trials 10 --num_envs 134 --run_name react_run      --model gpt-oss
python main.py --strategy reflexion          --num_trials 10 --num_envs 134 --run_name reflexion_run  --model gpt-oss --use_memory
python main.py --strategy retrieved_trajectory_reflexion \
               --num_trials 10 --num_envs 134 --run_name rar_run --model gpt-oss --use_memory
python main.py --strategy expel --expel_n_gather 10 \
               --num_trials 11 --num_envs 134 --run_name expel_run --model gpt-oss

Results are saved to ./root/<run_name>/.

HumanEval Hard (Code Generation)

Each experiment runs 50 curated HumanEval Hard problems across 10 iterations.

cd programming_runs
pip install -r requirements.txt

Run scripts

python main.py --strategy simple     --run_name simple    --root_dir root \
               --dataset_path ./benchmarks/humaneval-py_hardest50.jsonl \
               --language py --model gpt-oss --max_iters 10 --pass_at_k 1

python main.py --strategy cot_gt     --run_name cot_gt    --root_dir root \
               --dataset_path ./benchmarks/humaneval-py_hardest50.jsonl \
               --language py --model gpt-oss --max_iters 10 --pass_at_k 1

python main.py --strategy reflexion  --run_name reflexion --root_dir root \
               --dataset_path ./benchmarks/humaneval-py_hardest50.jsonl \
               --language py --model gpt-oss --max_iters 10 --pass_at_k 1

python main.py --strategy retrieval  --run_name retrieval --root_dir root \
               --dataset_path ./benchmarks/humaneval-py_hardest50.jsonl \
               --language py --model gpt-oss --max_iters 10 --pass_at_k 1

python main.py --strategy expel      --run_name expel     --root_dir root \
               --dataset_path ./benchmarks/humaneval-py_hardest50.jsonl \
               --language py --model gpt-oss --max_iters 10 --pass_at_k 1

Results are saved to root/<run_name>/.

Running on Kubernetes (NRP Nautilus)

All experiments can be submitted as Kubernetes batch jobs using the YAML files.

Before submitting any job

Open the YAML file and update the following fields:

metadata:
  namespace: <your-namespace> # e.g. kc-ai-research-lab — MUST match your Nautilus namespace
...
env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: <your-secret-name> # e.g. openai-secret — must exist in your namespace
        key: api-key # must match the key name inside your secret
...
volumes:
  - name: results-volume
    persistentVolumeClaim:
      claimName: <your-pvc-name> # e.g. reflexion-data-pvc — must exist in your namespace

To create the API key secret if it does not already exist:

kubectl create secret generic openai-secret \
  --from-literal=api-key=<your-api-key> \
  -n <your-namespace>

Submit a job

kubectl apply -f k8s/<job-file>.yaml -n <your-namespace>

Monitor job status

kubectl get jobs -n <your-namespace>
kubectl logs job/<job-name> -n <your-namespace>

Copy results from PVC

kubectl run pvc-reader --image=busybox --restart=Never \
  --overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"<your-pvc-name>"}}],"containers":[{"name":"pvc-reader","image":"busybox","command":["sleep","3600"],"volumeMounts":[{"mountPath":"/data","name":"data"}]}]}}' \
  -n <your-namespace>

kubectl cp <your-namespace>/pvc-reader:/data ./results
kubectl delete pod pvc-reader -n <your-namespace>

Available job files

File	Task	Strategy
`hotpot_react_job.yaml`	HotPotQA	ReAct
`hotpot_reflexion_job.yaml`	HotPotQA	Reflexion
`hotpot_retrieval_job.yaml`	HotPotQA	RAR
`hotpot_expel_job.yaml`	HotPotQA	ExpeL
`alfworld_react_job.yaml`	ALFWorld	ReAct
`alfworld_reflexion_job.yaml`	ALFWorld	Reflexion
`alfworld_retrieval_job.yaml`	ALFWorld	RAR
`alfworld_expel_job.yaml`	ALFWorld	ExpeL
`prog_simple_job.yaml`	HumanEval Hard	Simple
`prog_reflexion_job.yaml`	HumanEval Hard	Reflexion
`prog_retrieval_job.yaml`	HumanEval Hard	RAR
`prog_expel_job.yaml`	HumanEval Hard	ExpeL

Metrics

All experiments report four metrics at each trial or iteration:

Metric	Description
Success Rate	Cumulative fraction of tasks solved at or before trial $t$
Fail Rate	Fraction of tasks attempted and failed at trial $t$
Halt Rate	Fraction of tasks where the agent exhausted its step budget
Avg Steps	Mean number of environment interactions per active task

Results CSVs follow the format: Trial,SuccessRate,FailRate,HaltedRate,AvgSteps.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
alfworld_runs		alfworld_runs
figures		figures
hotpotqa_runs		hotpotqa_runs
plots		plots
programming_runs		programming_runs
webshop_runs		webshop_runs
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
Dockerfile-alf		Dockerfile-alf
LICENSE		LICENSE
README.md		README.md
alf_expel_job.yml		alf_expel_job.yml
alf_react_job.yml		alf_react_job.yml
alf_reflexion_job.yml		alf_reflexion_job.yml
alf_retrieval_job.yml		alf_retrieval_job.yml
alf_tapas_job.yml		alf_tapas_job.yml
data_pod.yml		data_pod.yml
expel_main_integration.md		expel_main_integration.md
expel_store.py		expel_store.py
generate_plots.py		generate_plots.py
generate_table.py		generate_table.py
hotpot_expel_job.yml		hotpot_expel_job.yml
hotpot_react_job.yml		hotpot_react_job.yml
hotpot_reflexion_job.yml		hotpot_reflexion_job.yml
hotpot_retrieval_job.yml		hotpot_retrieval_job.yml
hotpot_star_job.yml		hotpot_star_job.yml
hotpot_tapas_job.yml		hotpot_tapas_job.yml
policy_store.py		policy_store.py
prog_cot_gt.yml		prog_cot_gt.yml
prog_expel.yml		prog_expel.yml
prog_reflexion.yml		prog_reflexion.yml
prog_retrieval.yml		prog_retrieval.yml
prog_simple_job.yml		prog_simple_job.yml
prog_tapas.yml		prog_tapas.yml
pvc.yml		pvc.yml
pvc_llm.yml		pvc_llm.yml
react_base_job.yml		react_base_job.yml
reflexion_base_job.yml		reflexion_base_job.yml
retrieval_sentence_success_fail_job.yml		retrieval_sentence_success_fail_job.yml
run_llm_service.yml		run_llm_service.yml
run_mistralai_7b.yml		run_mistralai_7b.yml
train_pod.yml		train_pod.yml
txt		txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval-Augmented Reflexion: Retrieval-Aided Language Agents with Verbal Reinforcement Learning

Overview

Setup

Prerequisites

Environment variable

HotPotQA (Reasoning)

Run scripts

ALFWorld (Sequential Decision-Making)

Run scripts

HumanEval Hard (Code Generation)

Run scripts

Running on Kubernetes (NRP Nautilus)

Before submitting any job

Submit a job

Monitor job status

Copy results from PVC

Available job files

Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Retrieval-Augmented Reflexion: Retrieval-Aided Language Agents with Verbal Reinforcement Learning

Overview

Setup

Prerequisites

Environment variable

HotPotQA (Reasoning)

Run scripts

ALFWorld (Sequential Decision-Making)

Run scripts

HumanEval Hard (Code Generation)

Run scripts

Running on Kubernetes (NRP Nautilus)

Before submitting any job

Submit a job

Monitor job status

Copy results from PVC

Available job files

Metrics

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages