Ever wondered what goes on inside those powerful machine learning libraries like Scikit-Learn, PyTorch, or TensorFlow? How does a neural network actually learn? How is gradient descent implemented? How do different data-handling tools work?
SmolML is a fully functional (though simplified) machine learning library built using only pure Python and basic (collections, random, and math) modules. No NumPy, no SciPy, no C++ extensions. Just Python, all the way down.
The goal is to provide a transparent, understandable, and educational implementation of core machine learning concepts.
You can read these guides of the different sections of SmolML in any order, though this list presents the recommended order for learning.
- SmolML - Core: Automatic Differentiation & N-Dimensional Arrays
- SmolML - Regression: Predicting Continuous Values
- SmolML - Neural Networks: Backpropagation to the limit
- SmolML - Tree Models: Decisions, Decisions!
- SmolML - Support Vector Machines: Finding the Optimal Boundary!
- SmolML - K-Means: Finding Groups in Your Data!
- SmolML - Preprocessing: Make your data meaningful
- SmolML - The utility room!
We believe the best way to truly understand complex topics like machine learning is often to build them yourself. Production libraries are fantastic tools, but their internal complexity and optimizations can sometimes obscure the fundamental principles.
SmolML strips away these layers to focus on the core ideas:
- Every major component is built from scratch, letting you trace the logic from basic operations to complex algorithms.
- See how concepts like automatic differentiation (autograd), optimization algorithms, and model architectures are implemented in code, so that you can implement them yourself.
- Relying only on Python's standard library makes the codebase accessible and easy to explore without external setup hurdles.
- Code is written with understanding, not peak performance, as the primary goal.
In order to learn as much as possible, we recommend reading through the guides, checking the code, and then trying to implement your own versions of these components.
SmolML provides an implementation of the essential building blocks for any Machine Learning library:
-
The Foundation: Custom Arrays & Autograd Engine:
- Automatic Differentiation (
Value): A simple autograd engine that tracks operations and computes gradients automatically. (Seesmolml/core/value.py) - N-dimensional Arrays (
MLArray): A custom array implementation inspired by NumPy (though simplified), supporting common mathematical operations needed for ML. Extremely inefficient due to being written in Python, but ideal for understanding N-Dimensional Arrays, one of the most underrated skills of a ML engineer. (Seesmolml/core/ml_array.py)
- Automatic Differentiation (
-
Essential Preprocessing:
- Scalers (
StandardScaler,MinMaxScaler): Fundamental tools to prepare your data, because algorithms tend to perform better when features are on a similar scale. (Seesmolml/preprocessing/scalers.py)
- Scalers (
-
Build Your Own Neural Networks:
- Activation Functions: Non-linearities like
relu,sigmoid,softmax,tanhthat allow networks to learn complex patterns. (Seesmolml/utils/activation.py) - Weight Initializers: Smart strategies (
Xavier,He) to set initial network weights for stable training. (Seesmolml/utils/initializers.py) - Loss Functions: Ways to measure model error (
mse_loss,binary_cross_entropy,categorical_cross_entropy). (Seesmolml/utils/losses.py) - Optimizers: Algorithms like
SGD,Adam, andAdaGradthat update model weights based on gradients to minimize loss. (Seesmolml/utils/optimizers.py)
- Activation Functions: Non-linearities like
-
Classic ML Models:
- Regression: Implementations of
LinearandPolynomialregression. - Neural Networks: A flexible framework for building feed-forward neural networks.
- Tree-Based Models:
Decision TreeandRandom Forestimplementations for classification and regression. - K-Means:
KMeansunsupervised clustering algorithm for grouping similar data points together.
- Regression: Implementations of
SmolML is built for learning, and thus it should not be used for production. Being pure Python, it's WAAAAY slower and uses a ton more memory than libraries using optimized C/C++/Fortran backends (like NumPy).
It's best suited for small datasets and toy problems where understanding the mechanics is more important than computation time. Do not use SmolML for production applications. Stick to battle-tested libraries like Scikit-learn, PyTorch, TensorFlow, JAX, etc., for real-world tasks.
The best way to use SmolML is to clone this repository and explore the code and examples.
git clone https://github.com/rodmarkun/SmolML
cd SmolML
# Explore the code in the smolml/ directory!You can also run the multiple tests in the tests/ folder. Just install the requirements.txt (this is for comparing SmolML against another standard libraries like TensorFlow, sklearn, etc, and generate plots with matplotlib. SmolML does not use any of these libraries whatsoever).
cd tests
pip install -r requirementsContributions are always welcome! If you're interested in contributing to SmolML, please fork the repository and create a new branch for your changes. When you're done with your changes, submit a pull request to merge your changes into the main branch.
If you want to support SmolML, you can:
- Star ⭐ the project in Github!
- Donate 🪙 to my Ko-fi page!
- Share ❤️ the project with your friends!