The isoForest package is a simple replication of the Isolation Forests algorithm for outlier detection, and the ranger package is used to truly construct the forests. In addition, the visualization of outliers is also implemented to help better observe the prediction results.
# Development version
devtools::install_github("flystar233/isoForest")library(isoForest)
result <- isoForest(iris)
head(result$scores)
# id average_depth anomaly_score
# <int> <dbl> <dbl>
#1 1 7.81 0.554
#2 2 7.82 0.554
#3 3 7.71 0.559
#4 4 7.69 0.559
#5 5 7.78 0.556
#6 6 7.39 0.572The feature_contribution() function helps you understand which features contribute most to a sample's anomaly score. This is crucial for interpreting anomaly detection results and understanding why certain samples are flagged as outliers.
- Path-based analysis (default): Analyzes decision paths in isolation trees to determine feature importance
- Permutation importance: Measures how much each feature affects the anomaly score when its values are randomly permuted
# Train isolation forest
model <- isoForest(iris[1:4])
# Analyze feature contributions for anomalous samples
contributions <- feature_contribution(model, data = iris[1:4])
print(contributions)
# Analyze specific samples
contributions <- feature_contribution(model,
sample_ids = c(42, 107, 119),
data = iris[1:4])
print(contributions)# Path-based analysis (default, faster)
path_contributions <- feature_contribution(model,
sample_ids = c(1, 50),
data = iris[1:4],
method = "path")
# Permutation importance (more accurate but slower)
perm_contributions <- feature_contribution(model,
sample_ids = c(1, 50),
data = iris[1:4],
method = "permutation",
n_permutations = 50)The function returns contribution percentages showing how much each feature contributes to the anomaly score:
# Example output:
# Sample 42 | Score: 0.723
# Petal.Length: 45.2%
# Petal.Width: 32.1%
# Sepal.Length: 15.4%
# Sepal.Width: 7.3%The package provides multiple methods for setting anomaly detection thresholds. Instead of manually choosing a threshold, you can use statistical and geometric methods to automatically determine the optimal threshold.
| Method | Description | Best For |
|---|---|---|
| contamination | Set threshold based on expected outlier proportion | Known anomaly rate |
| quantile | Use a specific quantile as threshold | Percentile-based detection |
| iqr | Interquartile range (Q3 + 1.5×IQR) | Box-plot style analysis |
| zscore | Z-score based (mean + 2×sd) | Normal distributions |
| mad | Median Absolute Deviation | Robust, symmetric distributions |
| kde_weighted | KDE-weighted mean (density-weighted robust mean) | Heavy tails, extreme outliers |
| mtt | Modified Thompson Tau test | Small to medium samples |
| manual | User-specified threshold | Custom requirements |
# Train model
library(isoForest)
model <- isoForest(iris[1:4])
# Method 1: Contamination-based (most common)
result <- set_anomaly_threshold(model, method = "contamination", contamination = 0.05)
print(result)
# Get anomalous samples
anomalies <- iris[result$predictions$is_anomaly, ]
head(anomalies)For data with extreme outliers or heavy-tailed distributions:
# KDE-weighted method (density-aware, highly robust)
result_kde <- set_anomaly_threshold(model, method = "kde_weighted", kde_multiplier = 3)
# MAD method (robust and fast)
result_mad <- set_anomaly_threshold(model, method = "mad", mad_multiplier = 3)
# Compare results
cat("KDE-weighted detected:", sum(result_kde$predictions$is_anomaly), "anomalies\n")
cat("MAD detected:", sum(result_mad$predictions$is_anomaly), "anomalies\n")For small to medium sample sizes with statistical guarantees:
# Modified Thompson Tau test
result_mtt <- set_anomaly_threshold(
model,
method = "mtt",
mtt_alpha = 0.05, # Significance level
mtt_max_iter = 30 # Maximum iterations
)
# Adjust sensitivity
result_strict <- set_anomaly_threshold(model, method = "mtt", mtt_alpha = 0.01) # More conservative
result_loose <- set_anomaly_threshold(model, method = "mtt", mtt_alpha = 0.10) # More sensitiveThe package provides flexible visualization tools to understand how anomalies differ from normal data across features.
When you want to understand why a specific sample is anomalous:
# Calculate feature contributions
model <- isoForest(iris[1:4])
contributions <- feature_contribution(model, sample_ids = 42, data = iris[1:4])
# Single boxplot view (shows top contributing features)
plot_anomaly_boxplot(contributions, iris[1:4], sample_id = 42, top_n = 5)
# Faceted view (better for many features)
plot_anomaly_boxplot_faceted(contributions, iris[1:4], sample_id = 42, top_n = 8)When you want to see where all detected anomalies fall in the feature distributions:
# Detect anomalies using threshold
data <- read.csv('test.csv')
model2 <- isoForest(data)
result <- set_anomaly_threshold(model2, method = "mtt", mtt_alpha = 0.05)
anomaly_ids <- which(result$predictions$is_anomaly)
# Visualize all anomalies at once
plot_anomaly_boxplot(
contribution_obj = NULL, # No contribution object needed
data = data,
sample_id = anomaly_ids # Can be a vector of IDs
)
# Faceted view (recommended for multiple features)
plot_anomaly_boxplot_faceted(
contribution_obj = NULL,
data = data,
sample_id = anomaly_ids,
top_n = NULL # Show all features
)
For high-dimensional data (>4 features), visualize anomalies in 2D using dimensionality reduction:
# PCA projection (fast, interpretable)
plot_anomaly_projection(model, data, dim_reduction = "pca")
# UMAP projection (better for non-linear patterns, requires 'umap' package)
plot_anomaly_projection(model, data, dim_reduction = "umap")
# Compare both methods side-by-side (requires 'umap' and 'gridExtra' packages)
plot_anomaly_projection_all(model, data)
Features:
- Anomalies highlighted in red, normal points in blue
- Smart sampling for large datasets (preserves all anomalies)
- Adjust sampling:
sample_rate = 0.05(default, anomalies = 5% of display)
See ?plot_anomaly_projection for more details.
result <- isoForest(iris[1:2])
plot_anomaly_basic(result, iris[1:2], plot_type="heatmap")