From 5dc61fc0cd3c2f7143bc67125bae5cd800cb776d Mon Sep 17 00:00:00 2001 From: burkohs1 Date: Fri, 27 Mar 2026 17:31:37 -0400 Subject: [PATCH] draft of basic_demo --- vignettes/.gitignore | 2 + vignettes/basic_demo.Rmd | 339 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 341 insertions(+) create mode 100644 vignettes/.gitignore create mode 100644 vignettes/basic_demo.Rmd diff --git a/vignettes/.gitignore b/vignettes/.gitignore new file mode 100644 index 0000000..097b241 --- /dev/null +++ b/vignettes/.gitignore @@ -0,0 +1,2 @@ +*.html +*.R diff --git a/vignettes/basic_demo.Rmd b/vignettes/basic_demo.Rmd new file mode 100644 index 0000000..9bda0d8 --- /dev/null +++ b/vignettes/basic_demo.Rmd @@ -0,0 +1,339 @@ +--- +title: "Vignette1: basic_demo" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{basic_demo} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + echo = FALSE, # hide code + message = FALSE, # hide messages + warning = FALSE, # hide warnings + fig.width = 12, + fig.height = 7, + out.width = "100%" +) +``` + +```{r} +library(gsClusterDetect) +suppressWarnings(suppressMessages(library(data.table))) +data("example_count_data", package = "gsClusterDetect") +data("spline_01", package = "gsClusterDetect") +data("counties", package = "gsClusterDetect") +``` + +## Overview: Package Purpose + +Health departments commonly seek clusters of cases in space and time, as cases occurring within similar time frames in nearby locations could signal outbreaks or other events of public health importance. The data of interest intended for the current package are typically counts of outcomes for syndromic categories such as emergency department visits for selected infectious diseases, overdoses, or other harms. However, the package may be applied to seek clusters of other types of count data whose observations have space and time descriptors. The following paragraphs briefly summarize the cluster-finding method. + +The scan statistic is a widely used method for detecting clusters. For a given outcome, the method works by testing many candidate regions for high observed counts of that outcome and identifying regions with unusually high counts relative to expected levels. In many applications including the current one, expected counts are derived from a baseline set of historical data. The most anomalous regions are then evaluated for statistical significance and potential alerting. This approach is implemented in the SaTScan software (www.satscan.org), which uses a moving cylindrical window over space and time. The base of each cylinder defines a geographic area, while its height represents a time interval. For each cylinder, the observed number of cases is compared with the expected count using a test statistic. The cylinder with the maximum value of the scan statistic defines a candidate cluster. + +Because the distribution of the maximum scan statistic is not available in closed form, SaTScan estimates statistical significance using Monte Carlo simulation. That is, it repeatedly generates simulated datasets under a null model and compares the observed cluster to this reference distribution. The approach implemented in this package replaces these repeated simulations with an immediate significance assessment based on spline functions trained on a large set of archived SaTScan results. The user chooses a spline corresponding to a significance level; this spline defines a threshold for each possible observed count. For each candidate cluster, an alert is issued if its observed count exceeds the corresponding threshold. This method reduces computation time while providing a practical approximation to the standard SaTScan significance test, enabling rapid, on-demand cluster detection across outcomes and time scales. The approach is faster than using conventional repeated trials and provides epidemiologists with a proxy method to obtain clusters on demand, for any outcome, and of any duration. + +Requirements for this scan statistic implementation are: - assignment of the observed data to geographic regions with centroid point locations for distance calculations to enable expanding cylinder scans - observed counts of the data in each region for each time step - corresponding expected counts of the data, estimated from baseline data - distances between the regions + +## Package functionality + +The current package provides utilities for cluster detection on location-date-count surveillance data. It supports: + +- building location distance structures (matrix and sparse list forms), +- computing baseline/test summaries and visualization-ready outputs, +- running a full clustering workflow through `find_clusters()`, +- using lower-level functions when stepwise control is needed, +- simulating cluster injections for testing workflows. + +This vignette demonstrates all exported functions, with emphasis on the recommended wrapper: `find_clusters()`. + +## Technical Requirements to Enable Cluster Detection in User Data + +`find_clusters()` expects a data frame with at least these columns: + +- `location`: character location identifier, +- `date`: `Date` (or `IDate`) values, +- `count`: non-negative counts. Dates and locations with zero counts need not be provided. For many outcomes, counts are zero for the majority of locations in most dates and locations. The package code accounts for all date/location pairs. + +```{r, echo = TRUE} +cases <- data.table::as.data.table(example_count_data) +locs <- unique(cases$location)[1:12] +cases <- cases[location %in% locs] +cases <- cases[date >= (max(date) - 120)] +cases[, location := as.character(location)] + +head(cases) +``` + +The code below restricts the data to 12 locations for concise demonstration. Using the full dataset will not impact computation speed. + +Date helper functions: + +```{r, echo=TRUE} +detect_date <- max(cases$date) +get_test_dates(detect_date, test_length = 7) +## For reference, but not usually needed in application: +head(get_baseline_dates(detect_date, test_length = 7, baseline_length = 90, guard = 0)) +``` + +## Distance Structures for Geographic Scanning + +The scanning process requires distances between locations to enable systematic scanning of expanding cylinders in space. This package provides two ways to provide these distances: + +### Full Distance Matrix Functions: + +The functions zip_distance_matrix()`and`county_distance_matrix()\` provide full pairwise distance matrices, which may be sparse with many zeros, depending on the user data: + +```{r, echo=TRUE} +zip_dm <- zip_distance_matrix("DC") +dim(zip_dm$distance_matrix) + +county_dm <- county_distance_matrix("RI", source = "tigris") +dim(county_dm$distance_matrix) +``` + +`us_distance_matrix()` is available for all U.S. counties, but is typically too large for routine examples: + +```{r, eval = FALSE, echo=TRUE} +us_dm <- us_distance_matrix() +dim(us_dm$distance_matrix) +``` + +### Limited Distance Lists + +For applications with many hundreds of locations, the package function `create_dist_list()` returns a sparse neighbor representation. The output of this function is a list of vectors containing the nearby neighbor locations for each data location, where "nearby" means within a user-provided threshold. The example below returns a list of all counties within 25 miles for every county in Rhode Island. This list is adequate for detection of clusters whose radius is at most 25 miles. For many outcomes, users do not seek clusters whose geographic extent exceed a threshold. + +In summary, the matrix functions return all pairwise distances in a large, square matrix, while `create_dist_list()` returns only a sparse list of within-threshold neighbors (sparse list). The sparse lists are often faster/lighter for large geographies. + +```{r, echo=TRUE} +county_list <- create_dist_list( + level = "county", + st = "RI", + threshold = 25 +) +length(county_list) +head(county_list[[1]]) +``` + +## Main Wrapper: `find_clusters()` + +The primary user-facing function is `find_clusters`. This function executes the following component steps: + +1. build baseline/test count grids, +2. aggregate nearby-location counts by distance, +3. compute observed and expected values, +4. apply spline-derived significance thresholds, +5. compress candidate clusters to remove less significant ones and avoid overlap, +6. append all location-level counts for retained clusters. + +### Preparing Inputs + +For any distance object (matrix or sparse list), location names must match the `location` column in your case data. + +```{r, echo=TRUE} +locs <- unique(cases$location) +dm <- outer(seq_along(locs), seq_along(locs), function(i, j) abs(i - j) * 5) +dimnames(dm) <- list(locs, locs) +``` + +### Key Parameters in `find_clusters()` + +| Parameter | Purpose | Comment | +|------------------------|------------------------|------------------------| +| `cases` | Input location-date-count data | Include only rows whose count is positive. | +| `distance_matrix` | Named matrix (dense) or named list (sparse) of distances | Used for scanning to include progressively closer locations. | +| `detect_date` | End date for the detection window | Function will return only clusters ending on this date. | +| `spline_lookup` | Spline threshold table (`NULL`, built-ins `"05"`, `"01"`, `"005"`, `"001"`, or custom frame) | Derived from SaTScan cluster output runs with p-values of 0.05, 0.01, 0.005, and 0.001, to emulate those sensitivity requirements | +| `baseline_length` | Number of days in baseline period | Default is 90 days; can use larger intervals if available in `cases`. | +| `max_test_window_days` | Maximum test window size (days) | Number of days that a cluster can include, i.e. max cylinder height. | +| `guard_band` | Gap between baseline and test windows | Default is 0 days; can be used to separate test window from baseline. | +| `distance_limit` | Max cluster radius in same units as distances | Mean maximum distance from centroid to cluster members. | +| `baseline_adjustment` | Expected-value handling (`"add_one"`, `"add_one_global"`, `"add_test"`, `"none"`) | Select "add-test" to agree with SaTScan space-time permutation option "add-one" to add a baseline count when needed | +| `adj_constant` | Constant used when adjustment requires additive offset | default=1; constant added for "add_one" options to avoid division by zero | +| `min_clust_cases` / `max_clust_cases` | Pre-compression observed-count filters | to put a minimum or maximum on number of cluster cases | +| `return_interim` | Return intermediate objects for debugging/inspection | Lets user inspect intermediate data tables. | + +## Generating Cluster Signals for Detection + +This package also provides a function `st_detect()` to create artifical cluster signals for addition to background data in order to test and evaluate the detection capability of the `find_clusters`' method. A subsequent vignette will provide details on `st_detect()`. To create a clear demonstration signal, inject a small synthetic cluster first: + +```{r, echo=TRUE} +set.seed(42) +inj <- st_injects( + cases = cases, + distance_matrix = dm, + target_loc = locs[1], + center_decile = 5, + radius_miles = 10, + nr_cases = 40, + nr_days = 3, + end_date = detect_date +) + +cases_inj <- inj$case_counts_inj +nearby <- get_nearby_locations(locs[1], dm, radius_miles = 10) +head(nearby) +``` + +## Running `find_clusters()` + +```{r} +clusters <- find_clusters( + cases = cases_inj, + distance_matrix = dm, + detect_date = detect_date, + spline_lookup = "01", + baseline_length = 90, + max_test_window_days = 7, + guard_band = 0, + distance_limit = 15, + baseline_adjustment = "add_one", + adj_constant = 1, + min_clust_cases = 0, + max_clust_cases = Inf, + post_cluster_min_count = 0, + use_fast = TRUE, + return_interim = FALSE +) +``` + +## Built-In Spline Tables and Example Data + +The package ships with spline lookup tables. For observed cluster counts, tables provide significance thresholds approximating p-values of 0.001, 0.005, 0.01, and 0.05: + +- `spline_001` +- `spline_005` +- `spline_01` +- `spline_05` + +and demonstration data with tables of county and zip code locations and shape files: + +- `example_count_data` +- `counties` +- `zipcodes` + +These can be used directly in workflows, examples, and tests. \## Example cluster + +```{r plotting clusters, echo = TRUE} +library(gsClusterDetect) +library(dplyr) +library(ggplot2) +library(tigris) +library(sf) + +options(tigris_use_cache = TRUE) + +# Use example count data to find clusters +d <- example_count_data +dd <- d[, max(date)] +dm <- create_dist_list("county", st="OH", threshold = 50, ) +cl <- find_clusters(d,dm,dd) +# Join locations/clusters to tigris counties shape file +oh <- counties("OH", cb = TRUE, class = "sf") |> + left_join(cl$cluster_location_counts, by = c("GEOID" = "location")) + +# Find the number of significant clusters +n_groups <- nrow(cl$cluster_alert_table) +# Get that number of shades of blue +blue_vals <- setNames( + colorRampPalette(c("lightblue", "darkblue"))(n_groups), + sort(unique(na.omit(oh$target))) +) + +# Make a simple map of the shape file, +# filling with the target, and then coloring the fill-values +# with the blue vals +ggplot(oh) + + geom_sf(aes(fill = target), color = "black", linewidth = 0.2) + + scale_fill_manual(values = blue_vals, na.value = NA) + + theme_void() + + theme(legend.position = "none") +``` + +## Summary and Plotting Utilities + +Create statistical summary of input data: + +```{r, echo=TRUE} +summary_tbl <- generate_summary_table( + data = cases_inj, + end_date = detect_date, + baseline_length = 90, + test_length = 7 +) +summary_tbl +``` + +Create visual heatmap summary of input data to assess suitability for cluster detection. This plot illustrates spread of data counts over locations: + +```{r, echo=TRUE} +heatmap_data <- generate_heatmap_data( + data = cases_inj, + end_date = detect_date, + baseline_length = 90, + test_length = 7 +) + +p_heat <- generate_heatmap(heatmap_data, plot_type = "ggplot") +class(p_heat) +plot(p_heat) +``` + +Create time series of aggregated data counts over all locations: + +```{r, echo = TRUE} +ts_data <- generate_time_series_data( + data = cases_inj, + end_date = detect_date, + baseline_length = 90, + test_length = 7 +) + +p_ts <- generate_time_series_plot(ts_data, plot_type = "ggplot") +class(p_ts) +plot(p_ts) +``` + +Inspect intermediate objects: + +```{r, echo=TRUE} +interim <- find_clusters( + cases = cases_inj, + distance_matrix = dm, + detect_date = detect_date, + return_interim = TRUE +) + +names(interim) +``` + +## Intermediate Pipeline Functions + +These are used by the `find_clusters()` function and may be useful for custom diagnostics and method development. + +```{r, echo=TRUE} +cg <- generate_case_grids( + cases = cases_inj, + detect_date = detect_date, + baseline_length = 90, + max_test_window_days = 7 +) + +nci <- gen_nearby_case_info( + cg = cg, + distance_matrix = dm, + distance_limit = 15 +) + +oe <- generate_observed_expected( + nearby_counts = nci, + cases_grids = cg, + adjust = TRUE, + adj_constant = 1 +) + +cat_tbl <- add_spline_threshold(oe_grid = oe, spline_lookup = "01") +```