- Project Overview
- Learning Journey
- Prerequisites
- Getting Started
- Project Structure
- Understanding the Code
- Running the Project
- External Tools and Services
- Best Practices
- Troubleshooting
- Next Steps
This project demonstrates how to build, train, and deploy a machine learning model using Amazon SageMaker, AWS's fully managed machine learning service. It implements a mobile phone price classification system using a Random Forest classifier to predict phone price categories based on various features, showcasing a complete end-to-end cloud-based ML workflow.
This project serves as a practical guide to moving beyond local machine learning. You will see how to leverage AWS's powerful infrastructure to orchestrate data preparation, model training, and deployment, all from within a familiar Jupyter Notebook environment. SageMaker handles the heavy lifting of infrastructure, allowing you to focus on the ML logic.
This project is designed to take you through a progressive learning experience:
Foundation Level: Understand the basic structure of a SageMaker training script (script.py) and how the SageMaker Python SDK orchestrates jobs.
Intermediate Level: Learn about AWS IAM roles for security, S3 bucket management for data storage, and how SageMaker executes training jobs on dedicated cloud instances.
Advanced Level: Explore model deployment, creating real-time inference endpoints, and managing production-ready ML workflows.
Before diving in, you should have:
Technical Knowledge:
- Basic understanding of Python programming.
- Familiarity with pandas and scikit-learn.
- Foundational machine learning concepts (classification, training/testing splits).
- Basic knowledge of command-line operations.
AWS Requirements:
- An active AWS account with permissions to manage SageMaker, S3, and IAM roles.
- AWS CLI installed and configured locally.
Development Environment:
- Python 3.8 or higher.
- Jupyter Notebook or JupyterLab.
First, clone the repository and navigate to the project directory:
git clone https://github.com/GoJo-Rika/AWS-SageMaker-ML-Project.git
cd AWS-SageMaker-ML-ProjectWe recommend using uv, a fast, next-generation Python package manager, for setup.
-
Install
uvon your system if you haven't already.# On macOS and Linux curl -LsSf https://astral.sh/uv/install.sh | sh # On Windows powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
-
Create a virtual environment and install dependencies with a single command:
uv sync
This command creates a
.venvfolder and installs all packages fromrequirements.txt.
Note: For a comprehensive guide on
uv, you can visit this detailed tutorial: uv-tutorial-guide.
If you prefer to use the standard venv and pip:
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install the required dependencies:
pip install -r requirements.txt
Create a .env file by copying the sample template. This file will store your AWS-specific configuration.
cp .env.sample .envNow, edit the .env file with your details:
AWS_S3_BUCKET_NAME: A globally unique S3 bucket name for storing your data and models.AWS_SAGEMAKER_ROLE: The ARN of an IAM role with SageMaker execution permissions.
The S3 bucket acts as your data storage layer, while the SageMaker role grants your training jobs the necessary permissions to access resources like S3.
The project is organized to separate orchestration, training logic, and data.
aws-sagemaker/
├── .env.sample # Template for environment variables
├── research.ipynb # Jupyter notebook to orchestrate the entire workflow
├── script.py # Core training and inference logic for SageMaker
├── requirements.txt # Python dependencies
├── research.ipynb # Jupyter notebook for data exploration
├── train-V-1.csv # Training dataset
├── test-V-1.csv # Testing dataset
└── README.md # This documentation
research.ipynb: This is the main control center. You will execute cells here to upload data, start the SageMaker training job, and deploy the model.script.py: This script contains the pure ML code that SageMaker runs on a cloud instance. The notebook submits this script for execution.
The Jupyter notebook uses the SageMaker Python SDK, a high-level library for interacting with AWS services. The key component is the SKLearn estimator:
from sagemaker.sklearn.estimator import SKLearn
sklearn_estimator = SKLearn(
entry_point="script.py",
role=AWS_SAGEMAKER_ROLE,
instance_count=1,
instance_type="ml.m5.large",
framework_version="1.2-1",
hyperparameters={"n_estimators": 100, "random_state": 0},
)This code block tells SageMaker to:
- Run the code in
script.py. - Use the specified IAM
rolefor permissions. - Launch one
ml.m5.largemachine for the job. - Pass
n_estimatorsandrandom_stateas hyperparameters to the script.
The heart of the ML logic resides in script.py.
Argument Parsing: The script uses argparse to receive hyperparameters and essential paths from the SageMaker environment. SageMaker automatically injects environment variables like SM_MODEL_DIR and SM_CHANNEL_TRAIN, which the script accesses.
# Hyperparameters sent from the notebook
parser.add_argument("--n_estimators", type=int, default=100)
# Special paths provided by the SageMaker environment
parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))Model Persistence: The script saves the trained model to a specific directory provided by SageMaker. SageMaker then automatically packages this artifact and uploads it to S3.
# The path /opt/ml/model is mapped to the SM_MODEL_DIR environment variable
model_path = Path(args.model_dir) / "model.joblib"
joblib.dump(model, model_path)Inference Functions: The model_fn is a hook for SageMaker's inference service. When you deploy the model, SageMaker calls this function to load the model into memory.
def model_fn(model_dir: str):
# Load the model from the disk
clf = joblib.load(Path(model_dir) / "model.joblib")
return clfFor standard scikit-learn models, this is often the only function needed. For more complex cases, you can also implement input_fn, predict_fn, and output_fn to control data pre-processing and post-processing at the endpoint.
The intended way to run this project is to execute the cells in research.ipynb sequentially. The notebook will guide you through:
- Setting up the SageMaker session and S3 bucket.
- Exploring the dataset.
- Splitting the data and uploading it to S3.
- Defining the
SKLearnestimator. - Launching the training job with a call to
.fit(). - Deploying the trained model to a real-time endpoint with
.deploy(). - Making predictions with the endpoint.
- Cleaning up by deleting the endpoint.
Before running on SageMaker, you can test script.py locally to catch bugs quickly. This command simulates the SageMaker environment by providing local paths as arguments.
# Create a dummy model output directory
mkdir -p model_output
# Run the script, pointing the train/test channels to the current directory
python script.py --train . --test . --model-dir ./model_outputThis helps ensure the script runs without syntax errors before incurring cloud costs.
SageMaker is an ML platform that simplifies the machine learning lifecycle.
- Training Jobs: Managed compute instances that execute your
script.pywith the specified data. - Model Artifacts: The
model.tar.gzfile that SageMaker creates and versions in S3. - Endpoints: A fully managed, scalable HTTP endpoint for real-time model inference.
S3 is the central data repository. In this project, it stores the raw training/testing datasets and the final model artifacts generated by the SageMaker training job.
IAM roles are crucial for security. The role specified in your .env file must grant SageMaker permissions to read from your S3 bucket, write logs to CloudWatch, and create training jobs and endpoints.
- Separation of Concerns: The notebook handles orchestration, while
script.pyhandles the core ML logic. This makes the code modular and reusable. - Environment Management: Using a
.envfile prevents hardcoding secrets and makes the project portable. - Data Versioning: The project uses specific file names (
train-V-1.csv), highlighting the importance of versioning both code and data. - Reproducibility: Setting a
random_stateensures that the model's results are consistent across runs, which is critical for debugging and comparison.
- Permissions Errors:
AccessDeniedExceptionerrors usually mean yourAWS_SAGEMAKER_ROLEis missing permissions. Ensure it hasAmazonSageMakerFullAccessandS3FullAccess(or more restrictive policies) for your project bucket. - Data Loading Failures: Check CloudWatch logs for the training job.
FileNotFoundErroroften means the S3 path in the.fit()call is incorrect or the data wasn't uploaded properly. - Endpoint Failures: If the endpoint fails to deploy or returns errors, check the CloudWatch logs for the endpoint. This can indicate an issue in your
model_fnor other inference functions inscript.py.
- Add Model Evaluation: Enhance
script.pyto save evaluation metrics (like the classification report) as a JSON file in the model directory, making them easily accessible artifacts. - Implement Cross-Validation: Add k-fold cross-validation inside
script.pyfor more robust model evaluation.
- Hyperparameter Tuning: Use SageMaker's built-in automatic hyperparameter tuning capabilities to find the optimal
n_estimators. - SageMaker Pipelines: Re-architect the project into a SageMaker Pipeline for a fully automated, multi-step MLOps workflow.
- Model Monitoring: Implement data drift and model quality monitoring on the deployed endpoint to detect performance degradation over time.
- Explore Different Algorithms: Experiment with other scikit-learn algorithms or deep learning frameworks like
TensorFloworPyTorch. - Distributed Training: Learn about SageMaker's distributed training capabilities for handling larger datasets.
- MLOps Integration: Investigate SageMaker Pipelines for creating end-to-end ML workflows with automated testing and deployment.
This project provides a solid foundation for understanding cloud-based machine learning workflows. By working through each component systematically, you'll develop the skills necessary to build and deploy production-ready ML systems on AWS. Remember that mastering cloud ML is an iterative process—start with the basics, experiment with different approaches, and gradually incorporate more advanced features as your understanding deepens.