This project transforms MariaDB into an intelligent AI/ML platform by introducing Jupyter Magic Commands that seamlessly integrate vector operations, Retrieval-Augmented Generation (RAG) pipelines, and machine learning models directly with database workflows. Built entirely in Google Colab, it bridges the gap between traditional SQL analytics and modern AI inference, empowering developers and data scientists to perform complex AI tasks using intuitive, SQL-like commands within a familiar Jupyter environment.
%%mariadbMagic Command: Executes SQL queries directly from Jupyter and returns results as Pandas DataFrames.- Automatic Visualisation: Generates charts using Matplotlib or Seaborn for numeric query results.
%%mariadb_vectorExtension: Integrates Hugging Face embeddings with MariaDB Vector columns for semantic and similarity search.- RAG (Retrieval-Augmented Generation) Pipeline: Enables context-aware AI responses using database-driven document retrieval.
- Google Colab Support: Complete setup and execution directly in Colab, requiring no local installation.
- Reusable Module: Packaged code for quick integration in any AI or data workflow.
This project uses the OpenFlights dataset as the primary data source.
You can load it in Colab with:
!wget https://raw.githubusercontent.com/MariaDB/openflights/main/airports.dat -O airports.csv
!wget https://raw.githubusercontent.com/MariaDB/openflights/main/routes.dat -O routes.csvThese datasets include information about airports, airlines, and global routes — perfect for demonstrating data querying, visualization, and vector-based similarity search (e.g., “Find airports similar to Delhi”).
-
Install MariaDB and Dependencies
!apt-get update !apt-get install -y mariadb-server !pip install mariadb sqlalchemy pandas matplotlib seaborn transformers !service mysql start
-
Create and Load Database
import pandas as pd, mariadb conn = mariadb.connect(user='root', password='', database='test_db') cursor = conn.cursor() cursor.execute("CREATE DATABASE IF NOT EXISTS test_db;")
-
Run SQL Queries via Magic Command
%%mariadb SELECT Country, COUNT(*) AS AirportCount FROM airports GROUP BY Country ORDER BY AirportCount DESC LIMIT 10;
-
Visualize Results
import matplotlib.pyplot as plt df.plot(kind='bar', x='Country', y='AirportCount') plt.title('Top 10 Countries by Number of Airports') plt.show()
Use %%mariadb_vector to generate and store embeddings from text fields (e.g., airport names or cities) using the Hugging Face model sentence-transformers/all-MiniLM-L6-v2, enabling semantic similarity and RAG-based querying.
This directory includes a few sample datasets to get you started:
-
california_housing_data*.csv– California housing data from the 1990 US Census More info: Census Dataset Documentation -
mnist_*.csv– A small sample of the MNIST handwritten digits dataset Description: Yann LeCun’s MNIST Page -
anscombe.json– Contains a copy of Anscombe’s Quartet Source: Anscombe, F. J. (1973). Graphs in Statistical Analysis. American Statistician, 27(1): 17–21. JSTOR 2682899. Prepared via thevega_datasetslibrary.
A ready-to-use Colab demo notebook includes:
- Environment setup
- Dataset loading
- Example SQL queries and plots
- AI/Vector and RAG integration demos
This project is open-source and released under the MIT License.