The first cross-corpus structural classification of open mathematical problems at scale.
A unified dataset of 4,957 open mathematical problems from 5 sources, each classified with structural invariants (morphism vectors) and embedded in a shared vector space. Includes:
- An Erdős-like classifier that detects structural similarity to Erdős problems across the entire corpus
- A morphism graph connecting problems that share proof techniques
- Lean 4 formalizations of 21+ theorems across 4 Erdős problems
- Tools for mathematical problem discovery — find tractable problems nobody else is working on
# Clone
git clone https://github.com/MendozaLab/math-morphism-atlas.git
cd math-morphism-atlas
# Install dependencies
pip install -r requirements.txt
# Run the classifier on the full corpus
python scripts/embed_and_classify.py --all
# Predict attack strategy for a specific Erdős problem
python scripts/erdos_morphism_engine.py --predict 500
# Find the top 50 "Erdős-like" non-Erdős problems
python scripts/embed_and_classify.py --top-erdos-like 50| Source | Problems | Format | Description |
|---|---|---|---|
| Erdős Problems | 1,179 | Structured DB | Complete catalog with prizes, domains, formal statements |
| DeepMind formal-conjectures | 2,026 | Lean 4 | Formalized conjectures with AMS classifications |
| MathOverflow | 593 | Q&A | Community-curated open problems, voted by quality |
| OEIS | 437 | Sequence data | Integer sequences with attached conjectures |
| Open Problem Garden | 722 | Wiki | Cross-discipline open problems |
| Total | 4,957 |
Each problem in training_data/unified_corpus.jsonl:
{
"id": "erdos-233",
"source": "erdos",
"label": 1,
"text": "Problem statement...",
"domain": "number theory",
"tags": "product sets, prime factors"
}For each problem, a structural fingerprint:
| Dim | Feature | Values |
|---|---|---|
| 0 | Problem type | extremal=0, density=1, growth=2, coloring=3, construction=4 |
| 1 | Object type | set=0, graph=1, hypergraph=2, sequence=3, coloring=4 |
| 2 | Constraint type | cardinality=0, sum=1, product=2, containment=3, density=4 |
| 3 | Parameter count | 1-5 |
| 4-5 | Upper/lower exponents | Float (growth rate) |
| 6 | Gap ratio | upper/lower bound ratio |
| 7 | Log-factor count | Integer |
| 8-11 | Proof methods | Binary: probabilistic, algebraic, analytic, flag algebra |
Trained on 1,179 Erdős problems (positive) vs. 3,778 non-Erdős problems (negative).
A problem is "Erdős-like" if it shares structural DNA with problems Erdős posed — extremal questions about combinatorial objects with sharp thresholds and asymptotic growth rates.
Classify problems, embed into ChromaDB, discover morphism edges.
python scripts/erdos_morphism_engine.py --classify # Classify all problems
python scripts/erdos_morphism_engine.py --embed # Embed into ChromaDB
python scripts/erdos_morphism_engine.py --morphisms # Find technique-sharing edges
python scripts/erdos_morphism_engine.py --predict 500 # Attack prediction for problem #500Train and run the Erdős-like classifier.
python scripts/embed_and_classify.py --train # Train XGBoost classifier
python scripts/embed_and_classify.py --top-erdos-like 50 # Rank non-Erdős problems
python scripts/embed_and_classify.py --evaluate # Embedding quality testsMachine-verified proofs (0 sorry) in Lean/:
| File | Problem | Theorems |
|---|---|---|
Erdos1_SubsetSums_MDL_Solved.lean |
#1 Distinct subset sums | 3 |
Sunflower_MDL_Solved.lean |
#20 Sunflower lemma | 3 |
SidonCoding_v2_Solved.lean |
#30 Sidon sets | 2 |
Erdos233.lean |
Product set bounds | 2 |
DDC_Aristotle_Solved_v3.lean |
Divisor counting chain | 2 |
Verified against Lean 4.24.0 / Mathlib f897ebcf via Aristotle (Harmonic AI).
The classification uses the HS(p) Stability Sextet — a domain-independent fingerprint originally developed for physical systems (materials science, climate, medical diagnostics), here extended to combinatorial mathematics as the 10th domain spoke.
| Sextet Element | Combinatorial Meaning |
|---|---|
| I_det | Growth rate exponent |
| θ_holo | Extremal density / threshold |
| κ_curv | Log-factor complexity |
| R_kin | Upper/lower bound gap ratio |
| Q_takens | Formalization depth (0=none → 3=Lean verified) |
| S_vn | Structural entropy of extremal configuration |
If you use this dataset or tools, please cite:
@misc{mendoza2026morphism,
author = {Mendoza, Kenneth A.},
title = {Mathematical Problem Morphism Atlas: Structural Classification of
Open Mathematical Problems},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.XXXXXXX},
url = {https://github.com/MendozaLab/math-morphism-atlas},
note = {ORCID: 0009-0000-9475-5938}
}ORCID: 0009-0000-9475-5938
- Dataset: CC BY 4.0 — use freely, must cite
- Code: Apache 2.0
- Lean proofs: Apache 2.0
Individual source data retains original licensing:
- Erdős Problems: courtesy of erdosproblems.com
- DeepMind formal-conjectures: Apache 2.0
- MathOverflow: CC BY-SA 4.0
- OEIS: OEIS End-User License Agreement
- Open Problem Garden: CC BY-SA 3.0
This dataset was constructed with assistance from AI tools (Claude, Perplexity). All structural classifications have been reviewed by the human author. The Lean 4 proofs were generated with Aristotle (Harmonic AI) and verified by machine compilation. In accordance with the erdosproblems.com AI usage policy, AI assistance is disclosed and all mathematical content has been independently verified.
- erdosproblems.com maintained by Thomas Bloom
- DeepMind formal-conjectures
- Aristotle by Harmonic AI
- The Lean and Mathlib communities
- Open Problem Garden
Kenneth A. Mendoza Independent Researcher, Waldport, Oregon, USA ken@kenmendoza.com | ORCID 0009-0000-9475-5938