FloydARC adapts FloydNet to the ARC-AGI benchmark and achieves state-of-the-art performance among models trained primarily on ARC-style data (rather than large-scale web corpora).
The Abstraction and Reasoning Corpus (ARC) benchmark has attracted substantial interest in recent years. It evaluates a model’s ability to infer underlying rules from only a few examples, emphasizing reasoning and generalization. While large language models trained on massive internet data can achieve strong results, models trained mainly on ARC-style data face a significantly harder challenge. Prior work such as VARC and Loop-ViT shows that treating ARC as a vision-centric task can be highly effective.
FloydNet demonstrates strong performance on neural algorithmic reasoning. In this repository, we present FloydARC, a FloydNet-based system for ARC-AGI, achieving SOTA results among ARC-trained models.
| Model | #params | ARC-AGI-1 | ARC-AGI-2 |
|---|---|---|---|
| large language models (LLMs) | |||
| Deepseek R1 | 671B | 15.8 | 1.3 |
| Claude 3.7 8k | N/A | 21.2 | 0.9 |
| o3-mini-high | N/A | 34.5 | 3.0 |
| GPT-5 | N/A | 44.0 | 1.9 |
| Grok-4-thinking | 1.7T | 66.7 | 16.0 |
| Bespoke (Grok-4) | 1.7T | 79.6 | 29.4 |
| recurrent models | |||
| HRM | 27M | 40.3 | 5.0 |
| TRM | 7M | 44.6 | 7.8 |
| vision models | |||
| VARC | 73M | 60.4 | 11.1 |
| Loop-ViT | 11.2M | 61.2 | 10.3 |
| floydnet model | |||
| FloydARC (ours) | 153.7M | 70.5 | 15.3 |
For baselines (non-FloydARC), the reported ARC-AGI-1/2 numbers are taken from the public results summarized in VARC and Loop-ViT (see links above).
FloydARC architecture. Inputs are the query canvas and a noised answer canvas; patch tokens are generated from linear patch embedding. Following FloydNet, supernodes augment tokens into a pairwise relative representation, which is refined by K looped pivotal-attention blocks and a prediction head to produce the predicted answer canvas.
We train on ARC-GEN and ARC-CDG:
- ARC-GEN: tasks aligned with the original ARC-AGI-1 training set
https://github.com/google/ARC-GEN - ARC-CDG: a collection of more primitive, compositional operation tasks
https://github.com/Poolminer/ARC-CDG
We will release our preprocessed training data to Hugging Face hub.
To improve generalization, we apply a set of on-the-fly augmentations, including rotations, flips, and color transforms. See:
far/augmenter.pyfar/augment_op.py
Augmentations are applied during both training and inference.
ARC requires pixel-level precision. Instead of convolution-based patchification commonly used in vision models, we use linear patchify: flatten the canvas and map it to patch embeddings with a linear layer.
We inject metadata (e.g., task id, augmentation type) as supernodes into the representation. Supernodes will attend to all patch tokens and vice versa.
Following FloydNet, we initialize a pairwise relationship representation from pixel-level canvas features. We also inject metadata (e.g., task id, augmentation type) as supernodes into the representation.
We adopt a looped computation scheme: the same Pivital blocks are applied repeatedly to refine the hidden representation.
h = input_embedding(canvas)
for step in range(num_loops):
for b in blocks:
h = b(h)
output = output_head(h)We incorporate a diffusion-style process(DDPM) to improve generalization and multi-solution cases:
- Training: input includes the query canvas and a noised answer canvas.
- Inference: initialize the answer canvas with Gaussian noise, then iteratively denoise to produce the final output.
At inference time, we perform lightweight training on each test task’s demo puzzles. During TTT, we periodically predict the test puzzle and finally apply max-voting over intermediate predictions.
We provide two TTT modes:
- Full-model TTT: finetune all parameters.
- LoRA TTT: finetune only low-rank adapters (LoRA), typically improving generalization while preserving pretrained knowledge. Empirically, LoRA TTT performs better than full-model TTT on ARC-AGI-1.
| TTT mode | ARC-AGI-1 | ARC-AGI-2 | ||||
|---|---|---|---|---|---|---|
| Pass@1 | Pass@2 | Oracle | Pass@1 | Pass@2 | Oracle | |
| Full-model | 60.4 | 65.5 | 83.5 | 6.9 | 8.6 | 22.2 |
| LoRA | 65.5 | 69.4 | 84.8 | 13.6 | 14.7 | 26.7 |
| Ensemble | 65.9 | 70.5 | 87.5 | 12.4 | 15.3 | 30.8 |
This project targets Python >= 3.12.
# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# From repo root:
uv venv --python 3.12
source .venv/bin/activate
# Install dependencies
uv syncFrom the repo root, the expected layout looks like:
FloydARC/ # repo root
├── rawdata/ # place original ARC(-AGI) json files here (not tracked)
│ ├── ARC-AGI-1_evaluation/
│ │ └── **/*.json
│ ├── ARC-AGI-2_evaluation/
│ │ └── **/*.json
│ └── (optional) train_data/
│ └── **/*.json
├── preprocessed/ # generated by scripts.process_data
│ ├── arc1/
│ │ └── test/ # ARC-AGI-1 eval split outputs
│ ├── arc2/
│ │ └── test/ # ARC-AGI-2 eval split outputs
│ └── arc-train/
│ └── train/ # training split outputs
├── output/ # generated by scripts.TTT / scripts.analyze
│ ├── TTT_results_ARC1/ # full-model TTT outputs (example)
│ ├── TTT_results_LoRA_ARC1/ # LoRA TTT outputs (example)
│ └── *.html # visualization reports (example)
└── (anywhere) checkpoints/
└── floydarc_ckpt/ # downloaded checkpoint folder (pass via --ckpt_path)
Notes:
rawdata/should contain the original dataset JSON files (the script scans**/*.jsonrecursively).preprocessed/is fully generated bypython -m scripts.process_data ...and can be safely deleted/regenerated.output/contains per-run predictions and the HTML visualization created byscripts.analyze.
Hugging Face hub: https://huggingface.co/ocxlabs/FloydARC
# Build ARC-AGI-1 evaluation data
python -m scripts.process_data \
--input_dir ./rawdata/ARC-AGI-1_evaluation/ \
--output_dir ./preprocessed/arc1 \
--split test
# Build ARC-AGI-2 evaluation data
python -m scripts.process_data \
--input_dir ./rawdata/ARC-AGI-2_evaluation/ \
--output_dir ./preprocessed/arc2 \
--split testpython -m scripts.TTT \
--ckpt_path /path/to/downloaded/ckpt \
--subset arc1 \
--output_dir ./output/TTT_resultsBy default, the TTT script uses 8 GPUs on the current node. To use multiple nodes, write worker IPs to scripts/ip_list.txt before launching.
To evaluate ARC-AGI-2, set --subset arc2.
We provide a script to ensemble outputs (max-voting) and generate an HTML visualization.
# Analyze LoRA-TTT results
python -m scripts.analyze \
--result-folder ./output/TTT_results_LoRA_ARC1 \
--subset arc1 \
--out-html output/arc1_lora.html
# Analyze full-model TTT results
python -m scripts.analyze \
--result-folder ./output/TTT_results_ARC1 \
--subset arc1 \
--out-html output/arc1_full.html
# Ensemble both (max voting across folders)
python -m scripts.analyze \
--result-folder ./output/TTT_results_ARC1 ./output/TTT_results_LoRA_ARC1 \
--subset arc1 \
--out-html output/arc1_ensemble.htmlTo evaluate ARC-AGI-2, set --subset arc2.
process_data recursively scans JSON files under --input_dir and writes preprocessed outputs to --output_dir.
python -m scripts.process_data \
--input_dir /path/to/train_data \
--output_dir ./preprocessed/arc-train \
--split trainTo reproduce our training recipe, we recommend large-scale distributed training (e.g., 8 nodes / 64 GPUs).
./.venv/bin/torchrun \
--master_addr $master_addr \
--master_port $master_port \
--nproc_per_node 8 \
--nnodes $world_size \
--node_rank $node_rank \
-m scripts.run \
--dataset arc-train \
--wandb_log true \
--run_name FloydARC1 \
--compile trueIf you find this repository useful, please cite:
@misc{floydarc2026,
title = {FloydARC: FloydNet for ARC-AGI},
author = {Jingcheng Yu, Xi Chen, Mingliang Zeng, Qiwei Ye},
year = {2026},
url = {https://github.com/ocx-lab/Floyd-ARC}
}