Skip to content

Write generated docs as UTF-8 in typer ... utils docs --output#1881

Open
Sreekant13 wants to merge 2 commits into
fastapi:masterfrom
Sreekant13:fix/docs-output-utf8
Open

Write generated docs as UTF-8 in typer ... utils docs --output#1881
Sreekant13 wants to merge 2 commits into
fastapi:masterfrom
Sreekant13:fix/docs-output-utf8

Conversation

@Sreekant13

@Sreekant13 Sreekant13 commented Jul 3, 2026

Copy link
Copy Markdown

Discussion: #1882

Description

typer <app> utils docs --output FILE writes the generated Markdown with Path.write_text(clean_docs), which uses the platform's default encoding. When the CLI's help contains non-ASCII characters (emojis are common in Typer/Rich apps) this raises UnicodeEncodeError on interpreters whose locale encoding isn't UTF-8 (for example cp1252 on Windows).

Reproduction:

# emoji_app.py
import typer

app = typer.Typer()


@app.command()
def hello(name: str):
    """Say hello 👋 to someone."""
$ typer emoji_app utils docs --output out.md
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f44b' ...

This writes the file as UTF-8, which matches how the docs are read back in the tests (read_text(encoding="utf-8")).

I also added a regression test that forces a non-UTF-8 locale (LC_ALL=C, PYTHONUTF8=0) so it fails on the old behavior on any platform, not just Windows.

`typer <app> utils docs --output FILE` wrote the Markdown file using the
platform's default encoding, so non-ASCII help (for example emojis, which are
common in Typer/Rich CLIs) raised UnicodeEncodeError on interpreters where the
locale encoding is not UTF-8, such as cp1252 on Windows.

Write the file as UTF-8, matching how the docs are read back in the tests, and
add a regression test that forces a non-UTF-8 locale so it fails on the old
behavior on any platform.
@Sreekant13

Copy link
Copy Markdown
Author

Heads up: check-labels is red only because no category label is set yet. This is a bug fix, so it'd be bug. I can't add labels myself.

@phalberg

phalberg commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Hey, thanks for trying to contribute! A few notes, even though I am no maintainer;

  • Some of the previous tests still use the old read_text() without read_text(encoding="utf-8"), changing to this in the same in test_doc_output and test_doc_title_output could be considered for consistency.
  • Also the test in my opinion is doing two main concerns now, checking for emoji support and one that forces a non-UTF8 env. that uses UTF8 as standard under the hood, splitting it up in two tests could also be considered.
  • Another point is that you are forcing quite a lot of things in the env. PYTHONUTF8=0 is probably enough and I could even think of that this test case is hard to maintain in the future if the env. is so "artifical".

These are just some thoughts I had while looking at the PR..

LC_ALL=C already overrides LANG, so LANG=C was redundant. Keep LC_ALL=C
(forces a non-UTF-8 locale) and PYTHONUTF8=0 (keeps it non-UTF-8 on 3.15+
where UTF-8 mode is on by default), with a comment explaining why each is
needed for the test to fail on the old behavior on any platform.
@Sreekant13

Sreekant13 commented Jul 4, 2026

Copy link
Copy Markdown
Author

Thanks for the review, @phalberg!

  • On the env vars: good call on LANG=C. LC_ALL already overrides it, so I have dropped it (just pushed). I kept LC_ALL=C and PYTHONUTF8=0 though: PYTHONUTF8=0 on its own isn't enough, because most CI locales are already UTF-8, so getpreferredencoding() stays UTF-8 with UTF-8 mode off and the old code wouldn't fail there. LC_ALL=C is what forces a non-UTF-8 preferred encoding. I keep PYTHONUTF8=0 too so it still holds on Python 3.15+, where UTF-8 mode is on by default (PEP 686). Added a comment explaining that.

  • On splitting the test: I kept it as one on purpose. The regression only shows up when both conditions hold together: non-ASCII content and a non-UTF-8 locale. Emoji content alone passes on the old code in a UTF-8 environment, and a non-UTF-8 locale with ASCII content passes too, so splitting them would mean neither half fails without the fix. Happy to restructure if you have a split in mind that still guards the regression.

  • On updating the older tests to read_text(encoding="utf-8"): makes sense for consistency. I left them alone to keep this PR focused on the fix, but I am glad to include that here if you and the maintainers would prefer.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants