Simulation helpers for covariate-driven AnnData examples.
The current simulation API lives in _simulate_data/covar_dependent_feature.py and is re-exported from adata_science_tools._simulate_data.
sim_observations_covarssim_covar_dependent_featuressim_covar_dependent_dataset
sim_observations_covars(...) creates an obs_df with one column per requested covariate.
from adata_science_tools._simulate_data import sim_observations_covars
obs_df = sim_observations_covars(
obs_key_list=["Age", "case_control"],
obs_covar_dist_params={
"Age": {"dist": "normal", "mean": 50.0, "stdev": 10.0},
"case_control": {"dist": "binomial", "prob": 0.5},
},
n_obs=100,
random_seed=7,
)Important behavior:
obs_covar_dist_paramsis keyed by covariate name, not by distribution name.- Supported distributions are
normalandbinomial; the typo aliasbionomialis also accepted. - Normal draws are stored as float values.
- Binomial draws are stored as
0/1integer values. - Observation names use
obs_names_prefixwith 1-based indexing, for exampleobs_1,obs_2,obs_3.
sim_covar_dependent_features(...) treats the columns of obs_df as predictors, coerces them to numeric values, and generates a linear feature matrix with optional additive residual noise:
X = obs_matrix @ beta_matrix.T + yint + residual
from adata_science_tools._simulate_data import sim_covar_dependent_features
X, var_df, obs_df, adata = sim_covar_dependent_features(
obs_df=obs_df,
var_names=["simulated_feature"],
betas=[0.05, 5.0],
yints=10.0,
residual_stdev=1.0,
random_seed=7,
also_return_adata=True,
save_adata_dataset=False,
)Important behavior:
- A 1D
betassequence must match the number of covariates and is broadcast across all simulated features. - A 2D
betasarray must have shape(n_vars, n_covars). - Scalar
yintsvalues are broadcast across all simulated features. residual_meanandresidual_stdevaccept scalar values or 1D sequences of lengthn_vars.var_dfusesvar_namesas its index and storesyint, onebeta_<covariate>column per predictor, and the residual-noise settings used for each feature.- With the default
residual_stdev=0.0, the function remains deterministic for a fixedobs_df,betas, andyints. - When
adatais returned,adata.Xstores the observed noisy feature values,adata.layers["linear_mean"]stores the noiseless linear predictor, andadata.layers["residual"]stores the realized residual term. - Non-numeric predictor columns are rejected at this layer.
sim_covar_dependent_dataset(...) is the wrapper that first simulates covariates and then simulates features from those covariates.
from adata_science_tools._simulate_data import sim_covar_dependent_dataset
X, var_df, obs_df, adata = sim_covar_dependent_dataset(
obs_key_list=["Age", "case_control"],
obs_covar_dist_params={
"Age": {"dist": "normal", "mean": 50.0, "stdev": 10.0},
"case_control": {"dist": "binomial", "prob": 0.5},
},
n_obs=100,
random_seed=7,
var_names=["simulated_feature"],
betas=[0.05, 5.0],
yints=10.0,
residual_stdev=1.0,
save_adata_dataset=False,
)Important behavior:
- The wrapper returns
(X, var_df, obs_df, adata). AnnDatais created wheneveralso_return_adata=Trueorsave_adata_dataset=True.- Residual-noise settings are passed through to
sim_covar_dependent_features(...). - Dataset export reuses the package save helper and writes
.h5ad,.obs.csv,.var.csv, and.X.csvsidecars. - When
adata.layersare present, the same export path also writes one CSV per layer, so this simulator now emitslinear_meanandresidualsidecars alongside the main matrix export.
The repository now includes a config-driven simulation and plotting example in example_simulated_data/.
example_simulated_data/scripts/simulate_1_var_covar_age.pygenerates one feature,simulated_feature, from two predictors,Ageandcase_control, plus config-driven residualyvariance.- The numeric
case_controlbackend used for simulation is relabeled to public string values'case'and'control'before the dataset is saved. example_simulated_data/scripts/plot_dotplot_simulate_1_var_covar_age.pyloads the saved.h5adand plotssimulated_featureversusAgewithcase_controlas bothhueandsubset_key.- Both scripts are driven from
example_simulated_data/config/config.yaml.
The example config currently exposes these main simulation controls:
age_meanandage_stdev: control the center and spread of theAgecovariate.betaorbeta_age: controls the age slope forsimulated_feature.case_control_prob: controls the fraction of observations assigned to thecasegroup before relabeling from1/0to'case'/'control'.beta_case_control: controls the expected vertical shift betweencaseandcontrolin the simulated feature.residual_meanandresidual_stdev: control the additive residual noise around the linear mean model.random_seed: keeps the full simulated dataset deterministic across reruns.
In the current default baseline config:
beta_case_control: 2.0sets the mean case-control separation.residual_stdev: 1.0prevents subgroup points from falling exactly on their fitted lines.case_control_prob: 0.5targets an approximately balanced case/control split.
Run the example from the repo root with:
python example_simulated_data/scripts/simulate_1_var_covar_age.py
python example_simulated_data/scripts/plot_dotplot_simulate_1_var_covar_age.pyThe default baseline outputs are: