Skip to content

autogen-studio: pin utf-8 encoding on production text-file open() calls (refs #5566)#7723

Open
adv0r wants to merge 1 commit into
microsoft:mainfrom
adv0r:tokenburn/autogen-5566-studio
Open

autogen-studio: pin utf-8 encoding on production text-file open() calls (refs #5566)#7723
adv0r wants to merge 1 commit into
microsoft:mainfrom
adv0r:tokenburn/autogen-5566-studio

Conversation

@adv0r
Copy link
Copy Markdown

@adv0r adv0r commented May 20, 2026

Dear maintainer — this PR has a permanent home with methodology + opt-out at tokens-for-good. A one-line "no thanks" → auto-close + blacklist. Sorry for the notification this edit caused.


Why

Refs #5566. Continuation of the encoding sweep started in #6094
(playwright_controller.py) and continued in
the magentic-one-cli PR.

The original report explicitly flagged that "there will be some
similar issues in the codebase while using open function"
. On a
non-UTF-8 default locale (cp950 on Traditional Chinese Windows, cp1252
on Western European Windows, …), Python's open(..., \"r\") falls back
to the platform encoding and crashes with UnicodeDecodeError on any
non-ASCII byte.

This PR closes the autogen-studio production code paths that read
or write text files without an explicit encoding.

What changed

5 files, 11 open() call sites, all of the same shape:

- with open(path, \"r\") as f:
+ with open(path, \"r\", encoding=\"utf-8\") as f:
File sites
autogenstudio/cli.py 1 (writes runtime .env)
autogenstudio/lite/studio.py 1 (writes runtime .env)
autogenstudio/database/schema_manager.py 6 (Alembic env.py / script.py.mako / alembic.ini)
autogenstudio/web/auth/manager.py 1 (user-supplied YAML config)
autogenstudio/gallery/builder.py 1 (writes gallery_default.json)

Why these specific sites

  • Alembic templates and env.py can legitimately contain non-ASCII
    comments / paths and are read+rewritten on schema upgrades.
  • .env writers are called with the user's project path. Folder
    names with accented characters (very common on Windows) would crash
    the first run.
  • YAML auth config is user-supplied.
  • gallery_default.json can contain non-ASCII strings.

Scope deliberately narrowed

  • Production-code only. Test fixtures left out for a separate PR.
  • Skipped aiofiles.open in teammanager.py — async API signature
    is slightly different, deserves its own audited PR.
  • Did NOT sweep agbench/benchmarks/* — those are scenario scripts
    that consume JSONL produced by other agents; forcing UTF-8 there
    could mask issues upstream.

Verification

  • ast.parse(...) clean on all 5 touched files (no syntax break).
  • No encoding=...encoding= (double-add) anywhere.
  • 11 insertions, 11 deletions (same-line edits only).
  • No behaviour change for users already on UTF-8 locales.

Recommended next sweep (for a separate PR)

  • agbench (mixed: some files read agent-emitted JSONL, others are
    user scripts — needs case-by-case audit)
  • async aiofiles.open sites

AI-assisted via Cursor (Claude Opus 4.7). Personal token-burn
initiative by @adv0r to use up an expiring Cursor subscription budget on
small, useful upstream contributions.

Made with Cursor

Refs microsoft#5566.

Continuation of the same encoding sweep started in microsoft#6094 (which fixed
the original `playwright_controller.py` site) and continued in the
`magentic-one-cli` PR. The reporter of microsoft#5566 explicitly flagged that
*"there will be some similar issues in the codebase while using open
function"* — this PR closes the autogen-studio production code paths
that read or write text files without specifying an encoding.

On a non-UTF-8 default locale (e.g. cp950 on Traditional Chinese
Windows, cp1252 on Western European Windows), Python's `open(...,
"r")` falls back to the platform encoding and crashes with
`UnicodeDecodeError` on any non-ASCII byte. For autogen-studio that
manifests every time:

- `schema_manager.py` reads or writes Alembic templates (`env.py`,
  `script.py.mako`, `alembic.ini`) that may contain non-ASCII paths or
  comments
- `cli.py` / `lite/studio.py` write the runtime `.env` file (project
  paths can contain user/folder names with accented characters)
- `web/auth/manager.py` loads a user-supplied YAML config
- `gallery/builder.py` writes `gallery_default.json`

Files touched (11 lines, 5 files):

| File | open() sites fixed |
|------|--------------------|
| autogenstudio/cli.py | 1 |
| autogenstudio/lite/studio.py | 1 |
| autogenstudio/database/schema_manager.py | 6 |
| autogenstudio/web/auth/manager.py | 1 |
| autogenstudio/gallery/builder.py | 1 |

For every site the change is the same shape:

```python
- with open(path, "r") as f:
+ with open(path, "r", encoding="utf-8") as f:
```

Scope deliberately narrowed:

- **Production-code only** — no test fixtures.
- **Skipped `aiofiles.open` in `teammanager.py`** — the API is slightly
  different and that one deserves its own audited PR.
- **Did NOT sweep `agbench/benchmarks/*`** — those are user-facing
  scenario scripts that read JSONL produced by other agents; forcing
  UTF-8 there could mask issues upstream.

No behaviour change for already-UTF-8-locale users (UTF-8 IS what
Python opens these as on macOS/Linux today). All five files re-parsed
cleanly via `ast.parse(...)` after the rewrite.

AI-assisted via Cursor (Claude Opus 4.7). Personal token-burn
initiative by @adv0r to use up an expiring Cursor subscription budget
on small, useful upstream contributions.

Co-authored-by: Cursor <cursoragent@cursor.com>
@adv0r
Copy link
Copy Markdown
Author

adv0r commented May 20, 2026

Heads up: the CLA reply needs to come from the human account holder (@adv0r) directly, which I can't auto-post on their behalf in good conscience — the magic-phrase reply is a binding legal acceptance. I've flagged it on the user's side as a manual TODO and the CLA acceptance should land here shortly. The companion PR #7722 already shows license/cla: SUCCESS, so this looks like a per-PR re-acknowledgment after the initial sign.

@adv0r
Copy link
Copy Markdown
Author

adv0r commented May 28, 2026

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant