A Cognitive-Architecture Approach to Fine-Tuning Small Language Models
Using First-Principles Reasoning, Custom Blueprint Generation, and LoRA Training
(Model: Mistral-7B-Instruct, Local Training on Apple Silicon MPS)
Reasoning-Enhanced SLM 2.0 is an applied research project exploring how to modify the reasoning behavior of an open-weight LLM using:
- A structured reasoning architecture
- First-principles decomposition
- Blueprint-driven dataset generation
- Parameter-efficient LoRA fine-tuning
- Local training on Apple Silicon (MPS, fp16)
Unlike most fine-tuning projects that simply teach a model new content, this project teaches the model how to think in a specific domain style: terse, accurate, technical, DevOps-oriented reasoning.
This project introduces a Reasoning Blueprint System — a method for converting domain knowledge into structured reasoning patterns that LLMs can learn reliably.
The system is visually represented in three diagram groups.
This diagram shows how a complex domain (physics) can be reduced into:
- Final principles
- Explanatory principles
- Core logic principles
- Compressed reasoning blocks
This provides a universal method for turning high-level knowledge into reusable reasoning logic for machine learning datasets.
A hierarchical decomposition of physics into conceptual layers:
- Classical Physics → Evolution 1
- Modern Physics → Evolution 2
- Theoretical Physics → Evolution 3
- Applied & Interdisciplinary Physics → Evolution 4
This turns a complex domain into a structured curriculum suitable for model training.
This diagram demonstrates how domain knowledge can be layered to support curriculum-style dataset construction and reasoning architecture design.
This diagram illustrates the full data pipeline used in this project:
- Create a domain-specific reasoning framework
- Feed the architecture to an LLM
- The LLM converts it into a structured blueprint
- The blueprint undergoes pattern expansion
- The output becomes a supervised fine-tuning dataset
This system allows scalable production of structured reasoning data.
Training steps:
- Load Mistral-7B-Instruct on CPU in fp16
- Move model to MPS manually (bfloat16 unsupported on Apple Silicon)
- Load tokenizer (SentencePiece-based, non-fast)
- Inject LoRA adapters using PEFT
- Train using TRL's SFTTrainer
- Save LoRA adapter weights
Training runs fully on Apple Silicon (M4/M3/M2) using MPS acceleration.
Dataset stored in reasoning_dataset.jsonl, using an OpenAI-style message format:
{ "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ] }
- Model: Mistral-7B-Instruct-v0.3
- Adapter: LoRA (r=16, alpha=32, dropout=0.05)
- Device: Apple MPS (fp16)
Output directory:
mistral-7b-devops-lora/
│
├── adapter_config.json
├── adapter_model.bin
└── tokenizer/
Adapters can be merged or applied at inference time using PEFT.
Prompt: Explain why a /16 VPC is usually split into multiple /24 subnets.
Model Output (LoRA-tuned): To reduce broadcast domain size, minimize ARP noise, and segment workloads cleanly. Smaller /24 blocks isolate services (public, private, DB, mgmt), simplify routing, and support scalable multi-AZ layouts.
This demonstrates the intended terse, precise DevOps reasoning style.
Most fine-tuning projects adjust content.
This project adjusts cognition.
It demonstrates:
- First-principles reasoning compression
- System-level decomposition of knowledge
- Automated blueprint-to-dataset generation
- Local LoRA-based behavioral alignment
- Architecture-level thinking for ML systems
This aligns more with applied alignment research than standard ML fine-tuning.
- Add DPO for preference-based refinement
- Add reward modeling for multi-step reasoning
- Expand blueprint generator into an automated system
- Evaluate reasoning drift and alignment stability
- Fine-tune additional domains (physics, CS, finance, DevOps)
The model demonstrates concise, principle-driven explanation patterns learned from the dataset, avoiding the verbosity typical of base models.
The model applies first-principles decomposition without explicit chain-of-thought prompting, showing that the structure of reasoning has been internalized.
The model demonstrates the "compressed logic" architecture taught during training, breaking complex social science topics into atomic constraints and goals.
Reasoning-Enhanced SLM 2.0 includes a Retrieval-Augmented Generation (RAG) layer to provide grounded, updatable domain knowledge alongside the fine-tuned reasoning behavior.
Fine-tuning shapes how the model reasons.
RAG supplies what the model knows.
This separation allows the model to behave like a “well-read professor” without embedding factual knowledge into weights.
The RAG system operates in two phases.
- Download research papers (arXiv)
- Store documents locally
- Convert PDFs to text
- Chunk text into semantically coherent passages
- Generate embeddings using OpenAI (
text-embedding-3-small) - Store vectors and metadata in Qdrant
- Embed user question (same embedding model)
- Perform semantic search over Qdrant
- Retrieve top-k relevant chunks
- Assemble prompt with retrieved context
- Generate answer using Mistral-7B-Instruct (LoRA-fine-tuned, local runtime)
-
Embeddings
Model:text-embedding-3-small(OpenAI)
Converts text into 1536-dimensional semantic vectors. -
Memory Store
Qdrant (local vector database).
Persists embeddings and payload metadata (file path, chunk index, text). -
Retrieval
Dense semantic search using cosine similarity over Qdrant vectors. -
Generation
Mistral-7B-Instruct with LoRA adapters applies domain-specific reasoning to retrieved context.
The current RAG knowledge base is bootstrapped with three arXiv research papers, stored locally as text files.
rag/data/docs/ ├── 2512.11254v2.txt ├── 2512.17898v1.txt └── 2512.17901v1.txt
Each document is split into overlapping chunks.
Each chunk is embedded and stored as an independent retrievable memory unit.
RAG addresses limitations of fine-tuning alone:
- Updatability: add or remove papers without retraining
- Grounding: answers are supported by retrieved text
- Scalability: knowledge grows via storage, not parameter updates
The system combines:
- Behavioral alignment (LoRA fine-tuning)
- Externalized knowledge memory (RAG)
- Qdrant running locally
- Research papers ingested and embedded
- Query-time retrieval integrated with generation pipeline
Planned extensions:
- Larger document corpus
- Section-aware chunking
- Hybrid retrieval (BM25 + vectors)
- Reranking for improved precision
MIT License






