SpamShield is a lightweight, production-style machine-learning system for classifying SMS text messages as spam or ham (not spam). It includes a Python-based model training pipeline, a FastAPI prediction service, and a secure, signed REST API for remote inference.
SpamShield combines classical ML techniques with modern deployment practices to demonstrate an end-to-end machine learning lifecycle:
- Model training uses the UCI SMS Spam Collection Dataset.
Texts are vectorized using
TfidfVectorizerand classified using a tunedLogisticRegressionmodel. - Data management is automated via a KaggleHub download and preprocessing utility.
- Evaluation metrics include F1 score, Precision, Recall, and average precision (AUC-PR).
- Model packaging exports reproducible
.joblibfiles with integrity verification (SHA-256 hashes and metadata). - API service exposes
/predict,/health,/ready, and/metricsendpoints under FastAPI, with request-level Prometheus metrics and optional HMAC authentication.
All incoming requests can be optionally validated using HMAC signatures. Each client signs its requests as:
signature = hmac.new(
secret.encode("utf-8"),
f"{method}\n{path}\n{timestamp}\n{sha256(body)}\n{api_key}".encode("utf-8"),
hashlib.sha256
).hexdigest()The server recomputes this signature to verify authenticity and ensure the payload was not tampered with.
Note: This approach is designed for machine-to-machine integrity verification. For user-level authentication, API key management, or OAuth2-style login flows, consider integrating a provider or framework such as:
- FastAPI Users for JWT-based user registration and authentication
- Auth0, AWS Cognito, or Supabase Auth for managed identity and token-based authorization
- Combining HMAC signing with authenticated API keys for hybrid setups where both integrity and identity are important
- Prometheus metrics exposed at
/metrics:api_requests_total— per-route request countsrequest_latency_seconds— end-to-end latency histogrammodel_inference_seconds— inference-only timingrequest_payload_bytes— incoming payload size distribution
- JSON-structured logging with request IDs for traceability
SpamShield is containerized and can be deployed on AWS ECS or any environment supporting Docker.
Requirements:
- Python 3.14+
- uv (Astral’s Python package manager)
- Docker (for containerized runtime)
Ensure that you have a trained, versioned model available in the runtime models/ directory.
docker build --build-arg SPAMSHIELD_MODEL_VERSION=v1.0.0 -t spamshield:v1.0.0 .
docker run -p 8080:8080 spamshieldWhile designed to mimic production environments, SpamShield is intentionally simple and has a few limitations worth improving:
| Area | Limitation | Potential Improvement |
|---|---|---|
| Modeling | Classical logistic regression only. No contextual NLP | Experiment with transformer-based embeddings (e.g., distilbert-base-uncased) |
| Dataset | Limited to small SMS dataset | Add multilingual datasets and larger email/text messages |
| Thresholding | Static threshold stored in metadata | Implement dynamic calibration or per-user thresholds |
| Authentication | HMAC keys stored as plain environment vars | Integrate AWS Secrets Manager / KMS rotation |
| Scalability | Single model instance in memory | Add model caching & autoscaling with ECS target tracking |
| Monitoring | Basic Prometheus histograms only | Include inference-level metrics and model drift detection |
-
Create the dataset:
uv run create-spam-dataset --output data/spam.csv
-
Train the model:
uv run train-spam-model --dataset data/spam.csv --version v1.0.0 --plots
-
Move the model into the API runtime model directory:
mv models/v1.0.0 src/spamshield/api/models
-
Update the .env.dev to use the correct model version:
SPAMSHIELD_MODEL_VERSION="v1.0.0" -
Run the API:
uv run fastapi dev src/spamshield/api/main
-
Send a prediction request:
uv run scripts/request.py -u http://localhost:8000 -m "Click here for free cash!"Example Response:
{ "label": "spam", "score": 0.9823 }