We extend the Reflexion framework with Retrieval-Augmented Reflexion (RAR), a novel strategy that retrieves semantically similar past trajectories from a persistent episodic memory store, diversifies them via Maximum Marginal Relevance (MMR), and uses them as contrastive context when generating reflections. This allows the agent to learn from a broader set of past experiences — both failures and successes — rather than relying solely on the most recent attempt.
We implement and evaluate five strategies across three tasks:
| Strategy | Description |
|---|---|
| Simple | Single generation attempt, no reflection or memory (HumanEval only) |
| ReAct | No memory, no reflection. Agent attempts each task from scratch every trial |
| CoT + GT | Chain-of-thought with ground truth context injected (Wikipedia passage or docstring) |
| Reflexion | Standard Reflexion with last 3 reflections stored in memory |
| ExpeL | Two-phase: gather trajectories via Reflexion, extract insights, evaluate with injected insights |
| RAR (ours) | Retrieves top-k past trajectories via semantic similarity and error-class matching, diversified via MMR, used as contrastive reflection context |
- Python 3.9+
- An OpenAI-compatible API key
export OPENAI_API_KEY=<your key>Each experiment runs 100 randomly sampled questions from the HotPotQA distractor dataset across 5 trials.
git clone https://github.com/USD-AI-ResearchLab/reflexion.git
cd hotpotqa_runs
pip install -r requirements.txt
cd experimentspython ReactQA.py # ReAct baseline
python CoTQA.py # CoT + Ground Truth
python ReflexionQA.py # Standard Reflexion
python RetrievalQA.py # RAR (ours)
python ExpelQA.py # ExpeL baselineResults are saved to ../root/<strategy>/.
Each experiment runs 134 household environments across 10 trials.
cd alfworld_runs
pip install -r requirements.txtpython main.py --strategy base --num_trials 10 --num_envs 134 --run_name react_run --model gpt-oss
python main.py --strategy reflexion --num_trials 10 --num_envs 134 --run_name reflexion_run --model gpt-oss --use_memory
python main.py --strategy retrieved_trajectory_reflexion \
--num_trials 10 --num_envs 134 --run_name rar_run --model gpt-oss --use_memory
python main.py --strategy expel --expel_n_gather 10 \
--num_trials 11 --num_envs 134 --run_name expel_run --model gpt-ossResults are saved to ./root/<run_name>/.
Each experiment runs 50 curated HumanEval Hard problems across 10 iterations.
cd programming_runs
pip install -r requirements.txtpython main.py --strategy simple --run_name simple --root_dir root \
--dataset_path ./benchmarks/humaneval-py_hardest50.jsonl \
--language py --model gpt-oss --max_iters 10 --pass_at_k 1
python main.py --strategy cot_gt --run_name cot_gt --root_dir root \
--dataset_path ./benchmarks/humaneval-py_hardest50.jsonl \
--language py --model gpt-oss --max_iters 10 --pass_at_k 1
python main.py --strategy reflexion --run_name reflexion --root_dir root \
--dataset_path ./benchmarks/humaneval-py_hardest50.jsonl \
--language py --model gpt-oss --max_iters 10 --pass_at_k 1
python main.py --strategy retrieval --run_name retrieval --root_dir root \
--dataset_path ./benchmarks/humaneval-py_hardest50.jsonl \
--language py --model gpt-oss --max_iters 10 --pass_at_k 1
python main.py --strategy expel --run_name expel --root_dir root \
--dataset_path ./benchmarks/humaneval-py_hardest50.jsonl \
--language py --model gpt-oss --max_iters 10 --pass_at_k 1Results are saved to root/<run_name>/.
All experiments can be submitted as Kubernetes batch jobs using the YAML files.
Open the YAML file and update the following fields:
metadata:
namespace: <your-namespace> # e.g. kc-ai-research-lab — MUST match your Nautilus namespace
...
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: <your-secret-name> # e.g. openai-secret — must exist in your namespace
key: api-key # must match the key name inside your secret
...
volumes:
- name: results-volume
persistentVolumeClaim:
claimName: <your-pvc-name> # e.g. reflexion-data-pvc — must exist in your namespaceTo create the API key secret if it does not already exist:
kubectl create secret generic openai-secret \
--from-literal=api-key=<your-api-key> \
-n <your-namespace>kubectl apply -f k8s/<job-file>.yaml -n <your-namespace>kubectl get jobs -n <your-namespace>
kubectl logs job/<job-name> -n <your-namespace>kubectl run pvc-reader --image=busybox --restart=Never \
--overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"<your-pvc-name>"}}],"containers":[{"name":"pvc-reader","image":"busybox","command":["sleep","3600"],"volumeMounts":[{"mountPath":"/data","name":"data"}]}]}}' \
-n <your-namespace>
kubectl cp <your-namespace>/pvc-reader:/data ./results
kubectl delete pod pvc-reader -n <your-namespace>| File | Task | Strategy |
|---|---|---|
hotpot_react_job.yaml |
HotPotQA | ReAct |
hotpot_reflexion_job.yaml |
HotPotQA | Reflexion |
hotpot_retrieval_job.yaml |
HotPotQA | RAR |
hotpot_expel_job.yaml |
HotPotQA | ExpeL |
alfworld_react_job.yaml |
ALFWorld | ReAct |
alfworld_reflexion_job.yaml |
ALFWorld | Reflexion |
alfworld_retrieval_job.yaml |
ALFWorld | RAR |
alfworld_expel_job.yaml |
ALFWorld | ExpeL |
prog_simple_job.yaml |
HumanEval Hard | Simple |
prog_reflexion_job.yaml |
HumanEval Hard | Reflexion |
prog_retrieval_job.yaml |
HumanEval Hard | RAR |
prog_expel_job.yaml |
HumanEval Hard | ExpeL |
All experiments report four metrics at each trial or iteration:
| Metric | Description |
|---|---|
| Success Rate | Cumulative fraction of tasks solved at or before trial |
| Fail Rate | Fraction of tasks attempted and failed at trial |
| Halt Rate | Fraction of tasks where the agent exhausted its step budget |
| Avg Steps | Mean number of environment interactions per active task |
Results CSVs follow the format: Trial,SuccessRate,FailRate,HaltedRate,AvgSteps.
