This project applies machine learning regression techniques to model and predict soil pH based on bacterial microbiome composition. The dataset consists of farmland soil samples collected from multiple geographic locations, each characterized by microbial profiles derived from 16S rRNA amplicon sequencing. The study aims to explore the relationship between the microbial community structure and soil health metrics, with pH serving as the primary target variable. Understanding this relationship can contribute to precision agriculture, sustainable land management, and microbiome-driven soil diagnostics.
Objective To explore the microbial composition of farmland soils using dimensionality reduction and clustering, and to build regression models that predict soil pH from microbiome data. The project compares different machine learning algorithms and evaluates their performance on high-dimensional biological data.
Dataset Samples: 753 farmland soil samples Features: 6,798 amplicon sequence variants (ASVs) representing bacterial species/strains Additional attributes: 12 soil health metrics (including pH, water capacity, etc.) Target variable: Soil pH File: soil_health.csv.gz
Methodology
Data Exploration & Preprocessing Analyze microbiome composition and soil pH distribution Handle missing data and normalize ASV abundance values Apply PCA and unsupervised clustering (e.g., k-means or hierarchical clustering) to visualize community structure Model Selection & Training Implement and compare multiple regression models: Linear Regression / Ridge / Lasso Random Forest Regressor Gradient Boosted Trees (XGBoost / LightGBM) Support Vector Regression (SVR) or Neural Networks (optional) Optimize hyperparameters via cross-validation Evaluation Evaluate models using: R² score Mean Absolute Error (MAE) Root Mean Squared Error (RMSE) Compare predictive performance and interpret feature importance Interpretation & Insights Identify microbial taxa most correlated with soil pH variation Discuss potential biological and ecological implications of results