PySofra is the Python equivalent of R's gtsummary and tableone.
It produces the table types standard in clinical and epidemiological
manuscripts — baseline-characteristics tables (Table 1), regression
summaries, Kaplan–Meier survival tables — from a single immutable object
across seven output formats: HTML, Markdown, LaTeX, DOCX, PPTX, XLSX,
and PNG.
Python researchers have long reached for gtsummary, tableone, and
flextable in R for manuscript tables. PySofra brings that workflow
natively to Python, with survey-weighted designs and multiple-imputation
pooling built in.
Baseline-characteristics Table 1, stratified by treatment arm. Clinical theme. p-values, standardised mean differences, and an Overall column added as single-line modifiers.
Most reporting packages test that code runs. PySofra tests that numbers are correct.
A 54-step audit notebook downloads the 2017–18 US National Health and Nutrition Examination Survey directly from the CDC, fits models, and asserts 52 numerical contracts against independent R reference implementations:
| What | Reference | Tolerance | Observed |
|---|---|---|---|
| Weighted mean (5 variables) | R svymean |
≤ 10⁻⁹ rel | < 10⁻¹⁴ |
| Weighted mean — Table 1 cell | gtsummary display |
≤ 10⁻⁹ rel | 3.3 × 10⁻¹⁵ |
| Weighted SD — Table 1 cell | gtsummary display |
≤ 10⁻⁹ rel | 3.8 × 10⁻¹⁵ |
| Weighted proportion (4 vars) | gtsummary display |
≤ 10⁻⁹ rel | 4.6 × 10⁻¹⁵ |
| Survey regression SE (6 coefs) | R svyglm |
≤ 1% rel | < 0.8% |
| MI pooled β̂ | Rubin (1987) | ≤ 10⁻¹⁰ abs | < 10⁻¹⁴ |
| KM survival probability | lifelines | ≤ 10⁻¹² abs | < 10⁻¹⁵ |
| Wilson CI endpoint | Newcombe (1998) | ≤ 10⁻⁹ abs | < 10⁻¹⁵ |
Weighted means and SDs agree with R at floating-point machine precision — a few multiples of ε ≈ 2.2 × 10⁻¹⁶. That's not approximation; it's the same formula.
Nominal 95% CIs from survey-weighted logistic regression attain 94.2% and 93.8% empirical coverage in a 1,000-replicate Monte Carlo study with known truth.
Every contract runs in CI on every push. The full audit notebook is pre-executed and readable without installing anything; see AUDITOR.md for the single-command reproduction recipe.
pip install "pysofra[all]"import pysofra as ps
# Table 1 — baseline characteristics by treatment arm
tbl = (
ps.tbl_one(df, by="arm",
labels={"age": "Age (years)", "bmi": "BMI (kg/m²)"},
nonnormal=["bmi"])
.add_p()
.add_smd()
.add_overall()
.theme("clinical")
)
tbl # renders in Jupyter / VS Code / Colab
tbl.to_docx("table1.docx") # publication-quality Word
tbl.to_latex() # LaTeX fragment
tbl.to_html() # standalone HTML# Regression table with inline forest plot
import statsmodels.api as sm
fit = sm.Logit(df["event"], sm.add_constant(df[["age", "bmi"]])).fit(disp=False)
(
ps.tbl_regression(fit, exponentiate=True)
.with_forest_plot()
.bold_p()
.theme("jama")
.to_docx("table2.docx")
)# Survey-weighted Table 1 (NHANES-style design)
design = ps.SurveyDesign(weights="WTMEC2YR", strata="SDMVSTRA", cluster="SDMVPSU")
(
ps.tbl_one(df, by="diabetes", design=design)
.add_p()
.add_overall()
.theme("clinical")
.to_docx("table_nhanes.docx")
)# Multiple-imputation pooling (Rubin's rules)
pooled = ps.pool(fits) # list of per-imputation model fits
(
ps.tbl_regression(pooled, exponentiate=True, intercept=False)
.set_caption(f"Pooled ORs (m = {len(fits)}, Rubin's rules)")
.to_docx("table_mi.docx")
)| Feature | R ecosystem | PySofra | Python alternatives |
|---|---|---|---|
| Table 1 — baseline characteristics | tableone, gtsummary |
Yes | tableone (partial) |
| Regression table | gtsummary |
Yes | — |
| Survival (KM) summary | gtsummary |
Yes | — |
| Survey-weighted Table 1 | gtsummary + survey |
Yes | — |
| Multiple-imputation pooling | manual mice coordination |
Yes | — |
| Word + LaTeX + five more formats | separate packages | Yes | — |
| Byte-deterministic output | — | Yes | — |
| Safety diagnostics embedded in table | — | Yes | — |
| Machine-precision numerical validation | — | Yes | — |
One immutable object, seven output formats. Build a SofraTable once; render to
HTML, Markdown, LaTeX, DOCX, PPTX, XLSX, or PNG. Output is byte-identical across
processes — a hard requirement for reproducible manuscript artefacts tracked in git.
Typed-value cells. Every cell stores both a Python float and a rendered
string. bold_p(threshold=0.05) compares the float directly — it works correctly
whether the display reads "0.032", "0.03", or "<0.001". No fragile string
parsing at threshold-formatted values.
Auto-dispatched tests. Welch's t, ANOVA, Wilcoxon, Kruskal–Wallis, Fisher, χ², Taylor-linearised design-adjusted t, first-order Rao–Scott χ² — picked per row by variable kind, overridable per variable.
Publication-safety diagnostics. with_safety_warnings() appends footnotes
for separation in logistic regression, PH-assumption violations, and sparse
contingency cells — directly into the rendered DOCX, HTML, and LaTeX, not just
a console warning that disappears in batch output.
Adjusted odds ratios with inline forest plot tbl_regression(fit).with_forest_plot()
|
Kaplan–Meier table with embedded survival curve tbl_survival(...).with_km_plot()
|
Builders
| Builder | What it produces |
|---|---|
ps.tbl_one(df, by=) |
Baseline-characteristics table stratified by group |
ps.tbl_summary(df) |
Descriptive summary (no stratification) |
ps.tbl_cross(df, row=, col=) |
Two-way cross-tabulation |
ps.tbl_regression(fit) |
Regression results — statsmodels, lifelines, sklearn |
ps.tbl_uvregression(df, outcome=, predictors=) |
Univariable regression panel |
ps.tbl_survival(df, time=, event=, by=) |
Kaplan–Meier summary + optional curves |
Modifiers (each returns a new SofraTable)
add_p() · add_q() · add_smd() · add_overall() · add_difference() · add_ci() ·
bold_p() · bold_if() · highlight_if() · theme() · set_caption() · with_footnotes() ·
with_forest_plot() · with_km_plot() · with_safety_warnings()
Composition
tbl_merge() — horizontal merge with spanning headers
tbl_stack() — vertical stack with group-header rows
Survey & MI
SurveyDesign(weights=, strata=, cluster=, fpc=) — pass to any builder
pool(fits) — Rubin's rules MI pooling; result accepted by tbl_regression
Export
.to_html() · .to_markdown() · .to_latex() · .to_docx() · .to_pptx() ·
.to_xlsx() · .to_image() — all byte-deterministic across processes
Statistical methods
Welch t · Wilcoxon · ANOVA · Kruskal–Wallis · Fisher · χ² · Rao–Scott · Taylor-linearised design t · Wilson CI · Newcombe CI · Cohen d · Hedges g · Cramér V · η² · ω² · Kaplan–Meier · log-rank · Cox PH · Rubin's rules MI · BH · BY · Bonferroni · Holm · Hommel · Šidák
pip install pysofra # core (numpy, pandas, scipy, statsmodels, python-docx)
pip install "pysofra[survival]" # + lifelines, matplotlib
pip install "pysofra[plot]" # + forest plots, table-as-image
pip install "pysofra[pptx]" # + PowerPoint export
pip install "pysofra[xlsx]" # + Excel export
pip install "pysofra[polars]" # + polars DataFrame input
pip install "pysofra[sklearn]" # + scikit-learn model support
pip install "pysofra[all]" # everythingRequires Python ≥ 3.11.
- Quickstart
- Guides — tbl_one, tbl_regression, survey weights, MI pooling, themes, exports
- API reference
- NHANES validation notebook — 52 numerical contracts, live results
Version 0.1.0 — first stable release.
The public API is pinned by an explicit API-stability test — any unintended rename, removal, or signature change surfaces as a failed test before merge. The full deprecation policy is in Concepts → API stability.
Quality bar at this release:
- 1,036 tests passing on Python 3.11 and 3.12 (Ubuntu + macOS), 100% line coverage
- 52 numerical contracts against R
survey,gtsummary,lifelines,scipy,statsmodels, and textbook formulas — run in CI on every push - Property-based tests (Hypothesis) enforce universal invariants across 720 randomised examples per run: p-value ∈ [0, 1], ordered CI bounds, percent ∈ [0, 100]
- Byte-deterministic renderer output — identical input → identical bytes, across processes, required for reproducible git-tracked manuscript artefacts
- mypy strict · ruff clean · public signatures fully type-annotated
If you use PySofra in academic work, cite the software via the
CITATION.cff metadata (GitHub shows a Cite this repository
button in the right-hand sidebar). A full-length methods paper is in preparation.
Bug reports, feature requests, and pull requests are welcome.
See CONTRIBUTING.md for the workflow and quality gates, and
the Code of Conduct.
GPL-3.0-or-later. See LICENSE.

