Skip to content

(Markdown) code generation#451

Open
Seth Fitzsimmons (mojodna) wants to merge 35 commits intodevfrom
codegen
Open

(Markdown) code generation#451
Seth Fitzsimmons (mojodna) wants to merge 35 commits intodevfrom
codegen

Conversation

@mojodna
Copy link
Collaborator

@mojodna Seth Fitzsimmons (mojodna) commented Feb 27, 2026

Summary

Add overture-schema-codegen, a code generator that produces documentation from
Pydantic schema models.

Pydantic's model_json_schema() flattens the schema's domain vocabulary into JSON
Schema primitives. NewType names, constraint provenance, and custom constraint classes
disappear. Navigating Python's type annotation machinery -- NewType chains, nested
Annotated wrappers, union filtering, generic resolution -- is complex. The codegen
does it once. analyze_type() unwraps annotations into TypeInfo, a flat
target-independent representation that renderers consume without re-entering the type
system.

Architecture

Four layers with strict downward imports. The package layout mirrors the
architecture -- each layer is a sub-package:

markdown/    Rendering       ← Output formatting, all presentation decisions
layout/      Output Layout   ← What to generate, where it goes
extraction/  Extraction      ← TypeInfo, FieldSpec, ModelSpec, UnionSpec
             Discovery       ← discover_models() from overture-schema-core

Discovery lives in overture-schema-core, not in the codegen package.
cli.py sits at the package root and imports from all three sub-packages.

analyze_type() is the central function. A single iterative loop peels NewType,
Annotated, Union, and container wrappers in fixed order, accumulating constraints tagged
with the NewType that contributed them. The result is a TypeInfo dataclass that
downstream modules consume without re-entering the type system.

Both concrete BaseModel subclasses and discriminated union type aliases (like Segment = Annotated[Union[RoadSegment, ...], ...]) satisfy the FeatureSpec protocol and flow
through the same pipeline. Union extraction finds the common base class, partitions
fields into shared and variant-specific, and extracts the discriminator mapping.

markdown/pipeline.py orchestrates the full pipeline without I/O: tree expansion,
supplementary type collection, path assignment, reverse references, and rendering.
Returns list[RenderedPage]. The CLI writes files to disk with Docusaurus frontmatter.

Design doc: packages/overture-schema-codegen/docs/design.md

Changes outside the codegen package

Preparatory fixes and refactors in core/system/CLI packages:

  • Rename ModelKey.class_name to entry_point (carries module:Class path, not just the
    class name)
  • Attach docstrings to NewTypes at runtime (so the codegen can extract them)
  • Add resolve_discriminator_field_name() to system feature module
  • Fix relative imports and f-string prefixes in core
  • Use dict instead of Mapping in system test util type hints

Example data added to theme pyproject.toml files (addresses, base, buildings,
divisions, places) under [examples.ModelName] sections.

What's in the package

Source (33 files, ~3,800 lines):

Module Purpose
extraction/type_analyzer.py Iterative type unwrapping into TypeInfo
extraction/specs.py Data structures shared between extraction and rendering
extraction/type_registry.py Type name → per-target display string mapping
extraction/model_extraction.py Pydantic model → ModelSpec, tree expansion
extraction/union_extraction.py Union alias → UnionSpec, discriminator mapping
extraction/enum_extraction.py Enum → EnumSpec
extraction/newtype_extraction.py NewType → NewTypeSpec
extraction/primitive_extraction.py Numeric primitives and geometry types
extraction/field_constraints.py Constraint objects → display text
extraction/model_constraints.py Model-level constraints → prose
layout/module_layout.py Python module paths → output directories
layout/type_collection.py Supplementary type discovery from field trees
markdown/path_assignment.py Type names → output file paths
markdown/link_computation.py Relative links between output pages
markdown/reverse_references.py "Used By" reference computation
markdown/type_format.py TypeInfo → markdown type strings with links
markdown/renderer.py Jinja2 template driver for all page types
extraction/examples.py TOML example loading, validation, flattening
markdown/pipeline.py Pipeline orchestration (no I/O)
cli.py Click CLI: generate and list commands
extraction/case_conversion.py PascalCase → snake_case
extraction/docstring.py Custom vs. auto-generated docstring detection

Tests (34 files, ~6,600 lines): unit tests per module, golden file tests for
rendered markdown, integration tests against real schema models.

Design decisions worth reviewing

analyze_type is iterative, not recursive. The while True loop handles arbitrary
nesting depth (NewType wrapping Annotated wrapping NewType wrapping Annotated...)
without stack growth. Dict key/value types are the one exception where it recurses.

Cache insertion before recursion in expand_model_tree. The sub-model's ModelSpec
enters the cache before its fields are expanded. A back-edge encounter finds the cached
entry and marks starts_cycle=True rather than infinite-looping.

FeatureSpec is a Protocol, not a base class. ModelSpec and UnionSpec have
different field structures (flat list vs. annotated-field list with variant provenance).
A protocol lets them share a pipeline interface without forcing inheritance.

Schema root computed from all entry points, before theme filtering. Output directory
structure must remain stable regardless of which themes are selected. Computing the root
from filtered paths would shift directories when themes change.

Constraint provenance via ConstraintSource. Each constraint records which NewType
contributed it. Field-level constraints with source=None render on the field;
constraints with a named source render on the NewType's own page. This prevents
duplication.

Test plan

  • make check passes (pytest + doctests + ruff + mypy): 2,111 tests
  • make install && python -m overture.schema.codegen generate --format markdown --output-dir /tmp/schema-docs produces output
  • Spot-check generated markdown for a union feature (e.g., Segment) and a model
    feature (e.g., Building) -- field tables, links, constraint descriptions, examples
  • Verify cross-page links resolve correctly (supplementary types link back to
    features, features link to shared types)

pytest-subtests merged into pytest core as of pytest 9.
Update test imports from pytest_subtests.SubTests to
_pytest.subtests.Subtests.
- Add -q, --tb=short to `make test` for compact output
- Set verbosity_subtests=0 to suppress per-subtest
  progress characters (the u/,/- markers from pytest's
  built-in subtests support)
Bare triple-quoted strings after NewType assignments are
expression statements that Python never attaches to the
NewType object, leaving __doc__ as None. Convert each to
an explicit __doc__ assignment so codegen and introspection
tools can read them at runtime.

Same pattern DocumentedEnum uses for enum member docs.
OvertureFeature validator error message had two continuation
lines missing the f-prefix, so {self.__class__.__name__} was
rendered literally. Also add missing space before "and".
Replace hardcoded discriminator_fields tuple ("type", "theme",
"subtype") in _process_union_member with the discriminator field
name extracted from the union's Annotated metadata.

introspect_union already extracted the discriminator field name
but didn't pass it through to member processing. Now it does,
so unions using any field name as discriminator work correctly.

For nested unions, parent discriminator values are extracted from
nested leaf models to preserve structural tuple classification.

Feature.field_discriminator now attaches _field_name to the
callable, and _extract_discriminator_name reads it. This handles
the Discriminator-wrapping-a-callable case that str(disc) got
wrong silently.
Make _extract_literal_value return str directly instead of object,
eliminating implicit str() conversions at call sites. Add comment
explaining nested union re-indexing under the parent discriminator.

Remove redundant test covered by TestDiscriminatorDiscovery and
debugging print() calls from TestStructuralTuples.
The field holds the entry point value in "module:Class" format, not a
class name. The old name required callers to know this (codegen's cli.py
had a comment explaining it, and assigned to a local `entry_point`
variable to compensate).
Empty package with build config, namespace packages, and
py.typed marker. Declares click, jinja2, tomli, and
overture-schema-core/system as dependencies.
Type analyzer (analyze_type) handles all type unwrapping in a
single iterative function: NewType → Annotated → Union → list →
terminal classification. Constraints accumulate from Annotated
metadata with source tracking via ConstraintSource.

Data structures: TypeInfo (type representation), FieldSpec
(model field), ModelSpec (model), EnumSpec, NewTypeSpec,
PrimitiveSpec.

Type registry maps type names to per-target string
representations via TypeMapping. is_semantic_newtype()
distinguishes meaningful NewTypes from pass-through aliases.

Utilities: case_conversion (snake_case), docstring (cleaning
and custom-docstring detection).
Domain-specific extractors that consume analyze_type() and
produce specs:

- model_extraction: extract_model() for Pydantic models with
  MRO-aware field ordering, alias resolution, and recursive
  sub-model expansion via expand_model_tree()
- enum_extraction: extract_enum() for DocumentedEnum classes
- newtype_extraction: extract_newtype() for semantic NewTypes
- primitive_extraction: extract_primitives() for numeric types
  with range and precision introspection
- union_extraction: extract_union() with field merging across
  discriminated union variants

Shared test fixtures in codegen_test_support.py.
Generate prose from extracted constraint data:

- field_constraint_description: describe field-level
  constraints (ranges, patterns, unique items, hex colors)
  as human-readable notes with NewType source attribution
- model_constraint_description: describe model-level
  constraints (@require_any_of, @radio_group, @min_fields_set,
  @require_if, @forbid_if) as prose, with consolidation of
  same-field conditional constraints
Determine what artifacts to generate and where they go:

- module_layout: compute output directories for entry points,
  map Python module paths to filesystem output paths via
  compute_output_dir
- path_assignment: build_placement_registry maps types to
  output file paths. Feature models get {theme}/{slug}/,
  shared types get types/{subsystem}/, theme-local types
  nest under their feature or sit flat at theme level
- type_collection: discover supplementary types (enums,
  NewTypes, sub-models) by walking expanded feature trees
- link_computation: relative_link() computes cross-page
  links, LinkContext holds page path + registry for
  resolving links during rendering
Embed JSON example features in [tool.overture-schema.examples]
sections. Each example is a complete GeoJSON Feature matching
the theme's Pydantic model, used by the codegen example_loader
to render example tables in documentation.
Jinja2 templates and rendering logic for documentation pages:

- markdown_renderer: orchestrates page rendering for features,
  enums, NewTypes, primitives, and geometry. Recursively expands
  MODEL-kind fields inline with dot-notation.
- markdown_type_format: type string formatting with link-aware
  rendering via LinkContext
- example_loader: loads examples from theme pyproject.toml,
  validates against Pydantic models, flattens to dot-notation
- reverse_references: computes "Used By" cross-references
  between types and the features that reference them

Templates: feature, enum, newtype, primitives, geometry pages.
Golden-file snapshot tests verify rendered output stability.

Adds renderer-specific fixtures to conftest.py (cli_runner,
primitives_markdown, geometry_markdown).
Click-based CLI entry point (overture-codegen generate) that
wires discovery → extraction → output layout → rendering:

- Discovers models via discover_models() entry points
- Filters themes, extracts specs, builds placement registry
- Renders markdown pages with field tables, examples, cross-
  references, and sidebar metadata
- Supports --theme filtering and --output-dir targeting

Integration tests verify extraction against real Overture
models (Building, Division, Segment, etc.) to catch schema
drift. CLI tests verify end-to-end generation, output
structure, and link integrity.
Design doc covers the four-layer architecture, analyze_type(),
domain-specific extractors, and extension points for new output
targets.

Walkthrough traces Segment through the full pipeline
module-by-module in dependency order, with FeatureVersion as a
secondary example for constraint provenance in the type analyzer.

README describes the problem (Pydantic flattens domain vocabulary),
the "unwrap once, render many" approach, CLI usage, architecture
overview, and programmatic API.
TypeInfo.literal_value discarded multi-value Literals entirely
(Literal["a", "b"] got None). Renamed to literal_values as a
tuple of all args so consumers decide presentation.

single_literal_value() preserves its contract: returns the
value for single-arg Literals, None otherwise. Callers
(example_loader, union_extraction) are unchanged.

Multi-value Literals render as pipe-separated quoted values
in markdown tables: `"a"` \| `"b"`.
Replace TypeInfo.is_list: bool with list_depth: int so nested lists
like list[NewType("Hierarchy", list[HierarchyItem])] are handled
correctly. analyze_type increments list_depth for each list[...]
layer instead of setting a boolean. An is_list property preserves
the boolean API for depth-unaware consumers.

Markdown renderer: format_type and format_underlying_type wrap
list_depth times. _expandable_list_suffix returns "[]" per nesting
level for dot-notation expansion. Constraint annotation matching
strips all trailing "[]" suffixes instead of one.

Union extraction: _type_identity uses list_depth (int) instead of
is_list (bool) so fields with different nesting depths don't
incorrectly deduplicate.

Update design doc and walkthrough to reflect list_depth replacing
the is_list boolean throughout TypeInfo, _UnwrapState, type
formatting, and union deduplication.
@RoelBollens-TomTom
Copy link
Collaborator

Some observations without having looked into the code:

There are two different representations for a list in the markdown generation e.g:

Name Type Description
emails list<EmailStr> (optional) The email addresses of the place.
minimum length: 1
Ensures all items in a collection are unique. (UniqueItemsConstraint)
phones PhoneNumber (list, optional) The phone numbers of the place.
minimum length: 1
Ensures all items in a collection are unique. (UniqueItemsConstraint)

In places the referenced Address type points to the Address type of the addresses theme (../addresses/address.md) which is incorrect:

Name Type Description
addresses[] list<Address> (optional) The address or addresses of the place
minimum length: 1

And something very minor; when Pydantic types are used such as EmailStr or HttpUrl there wont be a reference:

Name Type Description
emails list<EmailStr> (optional) The email addresses of the place.
minimum length: 1
Ensures all items in a collection are unique. (UniqueItemsConstraint)
websites list<HttpUrl> (optional) The websites of the place.
minimum length: 1
Ensures all items in a collection are unique. (UniqueItemsConstraint)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing quite a bit (there is a lot of code here), but I like the structure + overall design and the generated markdown looks solid, so I'm approving.

Commented on a few small issues in addition to the aforementioned list representation confusion.

@mojodna
Copy link
Collaborator Author

Seth Fitzsimmons (mojodna) commented Mar 3, 2026

Roel Bollens (@RoelBollens-TomTom) good finds. I'm working on fixes for the incorrect Address reference (which results from name collisions) and creating pages + links for the Pydantic types.

The reason for the 2 different list representations is NewTypes that wrap list (and annotate with constraints) vs. vanilla lists. If we were to display PhoneNumber as list<T>, we'd lose the ability to link to PhoneNumber (and display its docstring, constraints, and references in more detail). We could render list<EmailStr> as "EmailStr (list, optional)", but then we'd be treating PhoneNumber and EmailStr as the same thing even though one is a list, the other is scalar.

Roel Bollens (@RoelBollens-TomTom) and Adam Lastowka (@Rachmanin0xFF) suggestions for making this less confusing?

@Rachmanin0xFF
Copy link
Contributor

Roel Bollens (Roel Bollens (@RoelBollens-TomTom)) and Adam Lastowka (Adam Lastowka (@Rachmanin0xFF)) suggestions for making this less confusing?

Maybe just a comment at https://github.com/OvertureMaps/schema/pull/451/changes#diff-d3543f3c56213c5ae4cf72e240b850d5cf763f9ceb7d2a9f9cf78c7602075739R110 would be fine.

Replace bare class name keys with TypeIdentity objects across all
registries. Two types with the same __name__ from different modules
(e.g., Places Address vs Addresses Address) now get separate registry
entries and resolve to different output paths.

TypeIdentity is a frozen dataclass pairing a unique Python object
(class, NewType callable, or union annotation) with its display name.
Equality and hashing delegate to object identity so lookups are
collision-free regardless of display name.

Changes across the pipeline:
- ConstraintSource stores source_ref (NewType callable) and
  source_name instead of a bare name string
- type_collection, path_assignment, link_computation, and
  reverse_references all key on TypeIdentity
- primitive_extraction returns TypeIdentity instead of strings
- Renderers construct TypeIdentity for link resolution
- Each spec type exposes an identity property via
  _SourceTypeIdentityMixin (or directly for UnionSpec)
MinLen/MaxLen: render as prose ("Minimum length: 1") instead of
wrapping the entire phrase in backticks. Math notation (≥, <) stays
in backticks; English words don't belong there.

UniqueItemsConstraint: reword docstring from class-description
phrasing ("Ensures all items in a collection are unique") to
validation-requirement phrasing ("All items must be unique"),
matching model-level constraint tone.

String constraints: normalize PhoneNumberConstraint,
RegionCodeConstraint, and WikidataIdConstraint docstrings to the
"Allows only..." pattern used by all other StringConstraint
subclasses.
Pydantic types like HttpUrl and EmailStr appear in field annotations
but previously rendered as unlinked inline code. Each referenced
Pydantic type now gets its own page under pydantic/<module>/ with a
description, upstream Pydantic docs link, and Used By section.

Discovery is reference-driven: the type collection visitor detects
PRIMITIVE-kind types from pydantic modules in expanded feature trees.
PydanticTypeSpec joins the SupplementarySpec union and flows through
placement, reverse references, and rendering.

Linking is registry-driven for all PRIMITIVE-kind types. Any primitive
with a page in the placement registry gets linked, whether it's a
Pydantic type (individual page) or a registered numeric primitive
(aggregate page). This also links int32/float64 to the primitives
page, which they weren't before.

Shared is_pydantic_sourced() predicate gates collection and reverse
reference tracking to pydantic-origin types without restricting the
linking mechanism.
Remove bbox from default skip keys so it renders in
example output like any other field.
After resolving type name collisions across themes (101596f),
two referrers from different modules can share a display name.
The sort key (kind, name) produced ties, and Python's sorted()
preserved set iteration order for tied elements -- which depends
on id()-based hashing and varies across process invocations.

Add the source module as a tiebreaker: (kind, name, module).
Expose TypeIdentity.module property to encapsulate the
getattr(obj, "__module__") access pattern.
Constraint annotations in table description cells ran directly
into the preceding description text with only a single <br/>.
Double the break so constraints read as a separate paragraph.
list[PhoneNumber] rendered as "PhoneNumber (list)" — implying
PhoneNumber itself is a list type. The root cause: format_type
couldn't distinguish list layers outside a NewType from list layers
inside one.

Add newtype_outer_list_depth to TypeInfo, snapshotted from list_depth
when the type analyzer enters the first NewType. The renderer uses
this to choose list<X> syntax (list wraps the NewType) vs a (list)
qualifier (NewType wraps a list internally). Non-NewType identities
(enums, models) continue using list<X>.
@mojodna
Copy link
Collaborator Author

I didn't understand the list rendering issue that both Roel and Adam flagged (because I was blind to it in the Markdown output), but after seeing it this morning, I fixed it in 48df0ab (part of this PR).

_truncate() produced strings up to 103 chars (100 + "..."). Account
for the 3-char ellipsis so output stays within the 100-char limit.
str() on string list items renders as [a, b], indistinguishable from
bare identifiers. repr() renders as ['a', 'b'] so strings are
visually distinct from numbers.
extract_model() on union members produced ModelSpecs with model=None
on MODEL-kind fields. _collect_from_fields then hit the RuntimeError
guard when it encountered those unexpanded references. Call
expand_model_tree() on each member before walking its fields.

No current union members have sub-model fields, so this was latent.
@vcschapp
Copy link
Collaborator

One comment about the Description - I haven't got to the code yet.

The first five lines of the Architecture section with its four layers is a super nice compression of info. "Context compaction" to coin a phrase. It primed me to find an explicit hierarchical organization into those 4 layers. Then when I get to What's in the package, it's a big 'ol viewport-filling 22-row table of flag filenames, which is also in the diff.

Would it make sense to group the files into directories/modules based on the layer they belong to? It'd definitely help with the job of climbing up and down the abstraction ladder.

flatten_example recursed into all dicts, splitting dict-typed fields
like `tags: dict[str, str]` into dot-notation rows. Now
collect_dict_paths walks the FieldSpec tree to identify dict-typed
field paths, and _flatten_value checks membership before recursing.

Indexed runtime paths (items[0].tags) are normalized to schema
notation (items[].tags) for matching. The pipeline computes
dict_paths from spec.fields and threads them through load_examples.

Also: clarify mutual exclusion in type visitor elif chains
(reverse_references, type_collection) and rename _TypeIdentity to
_TypeShape in union_extraction to avoid shadowing specs.TypeIdentity.
Move modules into three sub-packages matching the architecture layers:

- extraction/ (14 modules): type analysis, specs, extractors, constraints
- layout/ (2 modules): module layout, type collection
- markdown/ (6 modules + templates): pipeline, renderer, type formatting,
  links, paths, reverse references

Three modules renamed to drop redundant prefixes:
  field_constraint_description → extraction/field_constraints
  model_constraint_description → extraction/model_constraints
  example_loader → extraction/examples

Templates flattened from templates/markdown/ to markdown/templates/.
@mojodna
Copy link
Collaborator Author

I'd sketched a reorganization that also included extracting the Markdown generator (and using, guess what, entry points to register codegen targets) into separate packages and was planning on discussing that later. I pulled the split by layer into 1132e48.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants