Skip to content

fix(jike): harden export and plugin packaging#2

Open
MidnightDarling wants to merge 1 commit into
mainfrom
codex-arfd-365-jike-hardening
Open

fix(jike): harden export and plugin packaging#2
MidnightDarling wants to merge 1 commit into
mainfrom
codex-arfd-365-jike-hardening

Conversation

@MidnightDarling
Copy link
Copy Markdown
Owner

Summary

  • Fix ARFD-365 export security issues: Markdown escaping, URL/path safety, image host/type/size validation, atomic writes, JSON sidecar handling, and resumable checkpoints.
  • Converge scripts onto package entry points, add packaged export commands, align skill/plugin manifests, pin marketplace metadata to v0.4.2, and ignore .omx/ local tool state.
  • Update README, root/packaged skills, API reference, changelog, license metadata, CI, and wheel packaging checks.

Verification

  • pytest: 166 passed
  • ruff check: passed
  • tracked compile: passed
  • bandit: passed
  • pip-audit: no known vulnerabilities found
  • plugin validation: passed
  • root skill validation: passed
  • packaged skill validation: passed
  • workflow YAML parse: passed
  • wheel build/content check: missing=[] forbidden=[]

Attribution

Implemented by GPT-5 codex@openai.com. Linear ARFD-365 has been updated to Done with evidence comment under GPT identity.

Closes ARFD-365

Converge repository scripts onto package entry points and add packaged export commands.

Harden Markdown export content, image downloads, URL parameters, token output, and marketplace metadata.

Authored-by: GPT-5 <codex@openai.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Export command with resumable checkpoints and --checkpoint/--resume flags for tracking progress.
    • Support for optional topic IDs and link metadata when creating posts.
    • Comment deletion capability.
  • Improvements

    • Enhanced token security guidance via configuration file with restricted permissions.
    • Hardened export security: HTML/script validation, restricted image downloads by host allowlist, and safe QR image handling.
    • Improved API robustness with automatic retry-after throttling on rate-limit responses.
    • CI testing coverage added for code quality and security checks.

Walkthrough

This release consolidates duplicated script implementations into reusable library modules, introduces a complete export pipeline with security hardening, enhances the HTTP client with automatic retry logic and typed responses, and establishes a new CLI architecture. The package version is bumped to 0.4.2 with dependency constraints tightened and comprehensive CI testing added.

Changes

Core Library and CLI Release v0.4.2

Layer / File(s) Summary
Type System and Data Contracts
src/jike/types.py
New TypedDict models (UserInfo, PictureInfo, LinkInfo, JikePost, JikeResponse) define API response shapes. IMAGE_HOST_SUFFIXES and host_matches() enable secure image host validation for export.
Authentication Module with Hardening
src/jike/auth.py, tests/test_auth.py
Auth logic extracted with deadline-based polling, PNG header validation, error thresholds, and new --out option for token file output with 0600 permissions. Tests cover edge cases (no tokens, repeated 5xx errors, connection failures).
Client Library with Typed Responses and Auto-Retry
src/jike/client.py, tests/test_client.py
JikeClient now returns JikeResponse instead of dict, supports token-change callbacks, and automatically retries on 401 (refresh + retry) and 429 (sleep + retry). Method signatures expanded: create_post gains topic_ids and link_info; comment operations support target_type; URL quoting added for safe query construction. Tests verify retry behavior, parameter encoding, and new method signatures.
Command-Line Interface Module
src/jike/client_cli.py
New CLI module extracted from old scripts/client.py logic. Implements argparse-based dispatch for feed, posts, comments, search, profiles, notifications. Supports token env vars and link metadata construction.
Export Pipeline with Hardening
src/jike/export.py, src/jike/export_utils.py, tests/test_export.py
Introduces complete export-to-Markdown pipeline. export.py handles CLI wiring, auth, checkpoint loading/saving, and API calls with auto-refresh/retry. export_utils.py provides utilities: username validation, path traversal prevention, Markdown escaping, image downloading with content-type/size limits and hostname allow-listing, atomic writes, and timestamp formatting. Comprehensive tests validate security hardening (HTML escaping, unsafe link filtering, extension/host rejection, atomic write behavior).
CLI Entry Point Refactoring
src/jike/__main__.py, tests/test_main.py
Refactored dispatch to explicitly route auth, export, and default (client_cli) subcommands with proper argv slicing. Tests verify correct argument routing.
Scripts Converted to Thin Wrappers
scripts/auth.py, scripts/client.py, scripts/export.py
All three scripts reduced to sys.path setup and delegation to corresponding src/jike/ module main() functions.
Test Fixtures Simplified
tests/conftest.py
Token fixtures simplified to readable fake strings (fake-access-token-for-tests, etc.).

Release Configuration, Dependencies, and Documentation

Layer / File(s) Summary
Package Configuration and Dependencies
pyproject.toml
Version bumped to 0.4.2. Dependencies tightened: requests>=2.31,<3 (was >=2.28). QR optional dependency pinned: qrcode[pil]>=7.4,<9 and Pillow>=12.2,<13 (was unpinned). New dev optional group with bandit, pip-audit, pytest, ruff. Script entry point jike-api updated to use new client_cli module. Project URLs expanded. Wheel force-include added. Tool configurations (pytest, ruff, bandit) introduced.
Version and Metadata Updates
src/jike/__init__.py, .agents/plugins/marketplace.json, .claude-plugin/marketplace.json, .claude-plugin/plugin.json, .codex-plugin/plugin.json
Version set to 0.4.2 across all manifests and package metadata. Author attribution changed to "Alice and contributors". Codex plugin changed from installed-by-default to opt-in (AVAILABLE). Codex plugin pinned to ref v0.4.2 in agents marketplace.
CI Workflow with Comprehensive Checks
.github/workflows/ci.yml
New CI runs on PR and push to main/codex-/claude-. Runs Python 3.10 and 3.13 on ubuntu-latest. Installs with .[dev,qr] extras. Static checks: Ruff lint + py_compile. Security checks: Bandit + pip-audit. Runs pytest. Builds wheel and validates contents (required files present, forbidden artifacts absent).
Changelog, Security Policy, and License
CHANGELOG.md, SECURITY.md, LICENSE
New v0.4.2 release notes document export hardening, script consolidation, CLI enhancements, dependency updates. New security policy guides vulnerability reporting and token handling. License copyright updated to "Alice and contributors".
Documentation and API Reference Updates
README.md, SKILL.md, skills/jike/SKILL.md, references/api.md, .gitignore
README updated: version 0.4.2, new export/auth command docs, token hygiene section, checkpoint/resume flags, image host restrictions. SKILL.md files expanded with checkpoint examples, export operation details, and token safety guidance. API reference clarified optional fields and endpoint variants. .omx/ added to gitignore.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR introduces substantial new code (export pipeline, CLI refactoring, test suite) across multiple functional areas. While individual components are well-structured, reviewers must verify security hardening (image validation, HTML escaping, path safety), automatic retry logic correctness (401/429 handling), checkpoint resumability, and proper argument dispatch across the new module boundaries. The refactoring consolidates duplicated logic, requiring validation that the split is clean and all behaviors are preserved.

Poem

🐰 A rabbit hops through code, arranging blocks with care,
Scripts thin and light now float through autumn air,
Export's strong and safe, with checkpoints to resume,
Retries dance on 429—no more room in the room!
From scattered scripts to gathered src/, hooray—
v0.4.2 bounces bright, leading the way! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.72% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix(jike): harden export and plugin packaging' directly summarizes the main changes: security hardening of export functionality and plugin/marketplace packaging updates.
Description check ✅ Passed The PR description is well-detailed and clearly relates to the changeset, covering security fixes, convergence of scripts, manifest alignment, and providing comprehensive verification results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex-arfd-365-jike-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 79528fca6d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/jike/auth.py
return tokens

def _write_token_file(path: Path, tokens: TokenPair) -> None:
fd = os.open(path, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject symlink targets for auth token output

Writing --out with os.open(..., O_CREAT|O_TRUNC) follows symlinks, so on a shared or attacker-controlled directory an attacker can pre-create a symlink and make jike auth --out ... overwrite an unintended file. This is a security regression in the new file-output path despite advertising secure token storage; use O_NOFOLLOW (where available) plus an atomic temp-file replace strategy in the destination directory to avoid symlink clobbering.

Useful? React with 👍 / 👎.

Comment thread src/jike/client.py
Comment on lines 59 to +62
return self._request(method, path, retry_on_401=False, **kwargs)

if resp.status_code == 429 and retry_on_429:
time.sleep(_retry_after(resp))
return self._request(method, path, retry_on_429=False, **kwargs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve both retry guards across recursive requests

The retry recursion resets the other guard: the 401 path retries without forwarding retry_on_429, and the 429 path retries without forwarding retry_on_401. If the server alternates 401 and 429 responses (for example during token expiry plus rate limiting), the function can recurse repeatedly until a RecursionError instead of failing cleanly after one retry of each class.

Useful? React with 👍 / 👎.

Comment thread src/jike/export.py
Comment on lines +45 to +48
return _api_call(method, path, at, rt, retry=False, **kwargs)
if resp.status_code == 429 and retry_429:
time.sleep(_retry_after(resp))
return _api_call(method, path, at, rt, retry_429=False, **kwargs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep retry flags when recursing in export API helper

As in the main client, _api_call drops one retry flag when handling the other status (401 retry omits retry_429, 429 retry omits retry). Under alternating 401/429 responses, this can keep re-enabling retries and recurse until stack overflow rather than terminating after bounded retries.

Useful? React with 👍 / 👎.

Comment thread src/jike/auth.py
return tokens

def _write_token_file(path: Path, tokens: TokenPair) -> None:
fd = os.open(path, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce 0600 when overwriting existing token files

The mode argument on os.open(..., O_CREAT|O_TRUNC, 0o600) only applies when creating a new file, so existing --out files keep their old permissions (for example 0644). That means tokens can remain readable by other local users even though the CLI reports they were written with mode 0600. Explicitly fchmod the opened FD (or chmod the path) after opening.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
.github/workflows/ci.yml (1)

19-22: ⚡ Quick win

Security: Pin GitHub Actions to commit hashes.

Actions are not pinned to hashes, which creates a supply-chain risk if tags are moved or compromised. Pinning to commit hashes (with tag comments) is more secure.

🔒 Proposed fix
-      - uses: actions/checkout@v4
-      - uses: actions/setup-python@v5
+      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+      - uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
         with:
           python-version: ${{ matrix.python-version }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yml around lines 19 - 22, Pin the workflow steps using
actions/checkout@v4 and actions/setup-python@v5 to their specific commit SHAs
instead of the floating tag; update the uses entries for actions/checkout and
actions/setup-python to the corresponding full commit hash (and keep the tag as
a comment for readability) so the CI references an immutable version.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yml:
- Line 19: The checkout step using actions/checkout@v4 currently leaves
GITHUB_TOKEN persisted; update that checkout step configuration to add
persist-credentials: false so the workflow does not expose the runner's token to
subsequent steps (keep other options like fetch-depth or submodules unchanged
unless needed). Locate the step invoking actions/checkout@v4 in the CI workflow
and add the persist-credentials: false input to the step definition.

In `@scripts/auth.py`:
- Around line 6-8: The code computes SRC = Path(__file__).resolve().parents[1] /
"src" and unconditionally inserts it into sys.path; instead, first check
SRC.is_dir() and fail fast if it's missing to avoid falling back to an installed
"jike" package—e.g., if not SRC.is_dir(): print an explicit error and exit (or
raise SystemExit/RuntimeError) rather than mutating sys.path; only call
sys.path.insert(0, str(SRC)) when the directory exists, and apply the same
SRC.is_dir() guard to the other thin wrapper scripts that perform the same
import setup.

In `@src/jike/auth.py`:
- Around line 183-187: The _write_token_file function currently truncates the
user-provided path (following symlinks) and can leave partial JSON; change it to
write atomically by creating a new temporary file in path.parent (ensure the
parent directory exists), open that temp with mode 0o600, write tokens.to_dict()
JSON to it, flush and call os.fsync on the file descriptor to persist data,
close it, then atomically replace the target with os.replace(tmp_path, path);
this prevents symlink clobbering and truncated JSON while keeping TokenPair
serialization unchanged.

In `@src/jike/client_cli.py`:
- Around line 97-102: _warn_process_visible_tokens currently only checks for
"--access-token" and "--refresh-token" exact entries and misses equals-style
flags; update the check in _warn_process_visible_tokens to detect either the
flag alone or any argument that starts with "--access-token=" or
"--refresh-token=" (e.g., use membership OR startswith checks against argv
entries) so the warning is printed for both space-separated and equals-style
token usages.

In `@src/jike/client.py`:
- Around line 57-63: The recursive calls in _request drop the other retry flag;
when handling a 401 you must call _request with retry_on_401=False while
preserving retry_on_429, and when handling a 429 call _request with
retry_on_429=False while preserving retry_on_401, so update the two recursive
invocations to pass both retry_on_401 and retry_on_429 (only flipping the flag
you consumed) so the caller's preferences are not silently changed.

In `@src/jike/export_utils.py`:
- Around line 162-165: The atomic_write function can fail when the target's
parent directories don't exist; before creating the temporary file (tmp) ensure
its parent directory exists by calling Path.mkdir with parents=True and
exist_ok=True on tmp.parent (or path.parent) so the directory tree is created
before tmp.write_text and os.replace are called; keep the rest of atomic_write
(tmp name and os.replace) unchanged.

In `@src/jike/export.py`:
- Around line 115-117: _warn_process_visible_tokens currently only detects
separate flag tokens like "--access-token" or "--refresh-token" and misses
equals-style forms like "--access-token=..."; update the check to detect both
forms by iterating argv and treating an arg as a match if it equals the flag or
startswith the flag + "=" (e.g., any(a == "--access-token" or
a.startswith("--access-token=") or a == "--refresh-token" or
a.startswith("--refresh-token=") for a in argv)), then print the same stderr
warning when any match is found.
- Around line 86-98: The pagination loop in export_user posts can hang if the
API returns a truthy load_more_key but the cursor doesn't advance; modify the
loop in the block using fetch_user_posts to detect non-advancing cursors by
tracking the previous load_more_key (and optionally previous (at, rt) or total
posts count) and aborting after a small number of repeated, unchanged cursors
(or when returned batch length is zero despite a truthy load_more_key); when
aborting, persist the checkpoint via _save_checkpoint and return the collected
posts, at, rt; reference variables/functions: fetch_user_posts, load_more_key,
_save_checkpoint, posts, page, RATE_LIMIT_DELAY.

---

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 19-22: Pin the workflow steps using actions/checkout@v4 and
actions/setup-python@v5 to their specific commit SHAs instead of the floating
tag; update the uses entries for actions/checkout and actions/setup-python to
the corresponding full commit hash (and keep the tag as a comment for
readability) so the CI references an immutable version.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b8ca556b-be9a-4c97-a3bd-3572eb7f7106

📥 Commits

Reviewing files that changed from the base of the PR and between 73f4d8a and 79528fc.

📒 Files selected for processing (30)
  • .agents/plugins/marketplace.json
  • .claude-plugin/marketplace.json
  • .claude-plugin/plugin.json
  • .codex-plugin/plugin.json
  • .github/workflows/ci.yml
  • .gitignore
  • CHANGELOG.md
  • LICENSE
  • README.md
  • SECURITY.md
  • SKILL.md
  • pyproject.toml
  • references/api.md
  • scripts/auth.py
  • scripts/client.py
  • scripts/export.py
  • skills/jike/SKILL.md
  • src/jike/__init__.py
  • src/jike/__main__.py
  • src/jike/auth.py
  • src/jike/client.py
  • src/jike/client_cli.py
  • src/jike/export.py
  • src/jike/export_utils.py
  • src/jike/types.py
  • tests/conftest.py
  • tests/test_auth.py
  • tests/test_client.py
  • tests/test_export.py
  • tests/test_main.py

Comment thread .github/workflows/ci.yml
matrix:
python-version: ["3.10", "3.13"]
steps:
- uses: actions/checkout@v4
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Security: Set persist-credentials: false to prevent token exposure.

The workflow does not set persist-credentials: false, leaving the GITHUB_TOKEN available in subsequent steps. If any step writes files or uploads artifacts, the token could be exposed.

🔒 Proposed fix
       - uses: actions/checkout@v4
+        with:
+          persist-credentials: false
       - uses: actions/setup-python@v5
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- uses: actions/checkout@v4
- uses: actions/checkout@v4
with:
persist-credentials: false
🧰 Tools
🪛 zizmor (1.25.2)

[warning] 19-19: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 19-19: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yml at line 19, The checkout step using
actions/checkout@v4 currently leaves GITHUB_TOKEN persisted; update that
checkout step configuration to add persist-credentials: false so the workflow
does not expose the runner's token to subsequent steps (keep other options like
fetch-depth or submodules unchanged unless needed). Locate the step invoking
actions/checkout@v4 in the CI workflow and add the persist-credentials: false
input to the step definition.

Comment thread scripts/auth.py
Comment on lines +6 to +8
SRC = Path(__file__).resolve().parents[1] / "src"
if str(SRC) not in sys.path:
sys.path.insert(0, str(SRC))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when the bundled src/ tree is missing.

If scripts/../src is absent here, the later import jike... will fall through to any other jike package already on sys.path, which can silently run a stale installed version instead of this checkout/plugin bundle. Please guard SRC.is_dir() before mutating sys.path, and mirror the same check in the other thin wrappers in this PR.

Proposed fix
 SRC = Path(__file__).resolve().parents[1] / "src"
+if not SRC.is_dir():
+    raise RuntimeError(f"Expected bundled src directory at {SRC}")
 if str(SRC) not in sys.path:
     sys.path.insert(0, str(SRC))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SRC = Path(__file__).resolve().parents[1] / "src"
if str(SRC) not in sys.path:
sys.path.insert(0, str(SRC))
SRC = Path(__file__).resolve().parents[1] / "src"
if not SRC.is_dir():
raise RuntimeError(f"Expected bundled src directory at {SRC}")
if str(SRC) not in sys.path:
sys.path.insert(0, str(SRC))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/auth.py` around lines 6 - 8, The code computes SRC =
Path(__file__).resolve().parents[1] / "src" and unconditionally inserts it into
sys.path; instead, first check SRC.is_dir() and fail fast if it's missing to
avoid falling back to an installed "jike" package—e.g., if not SRC.is_dir():
print an explicit error and exit (or raise SystemExit/RuntimeError) rather than
mutating sys.path; only call sys.path.insert(0, str(SRC)) when the directory
exists, and apply the same SRC.is_dir() guard to the other thin wrapper scripts
that perform the same import setup.

Comment thread src/jike/auth.py
Comment on lines +183 to +187
def _write_token_file(path: Path, tokens: TokenPair) -> None:
fd = os.open(path, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
with os.fdopen(fd, "w", encoding="utf-8") as f:
json.dump(tokens.to_dict(), f, indent=2)
f.write("\n")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate and inspect the relevant function and surrounding context
sed -n '140,230p' src/jike/auth.py

# Show imports to see whether tempfile/os.replace/etc are already used elsewhere in the module
sed -n '1,120p' src/jike/auth.py

# Search for other token file write logic and for where the destination path is derived
rg -n "_write_token_file|token_file|out\b|--out|write_token|replace\(|O_TRUNC|fdopen\(" src/jike/auth.py

Repository: MidnightDarling/jike-skill

Length of output: 8425


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check python version constraints
ls -la
fd -a -t f 'pyproject.toml|setup.cfg|setup.py|requirements*.txt|Pipfile|tox.ini' . | sed -n '1,200p' || true

if [ -f pyproject.toml ]; then
  rg -n "requires-python|python_requires" pyproject.toml || true
fi
if [ -f setup.cfg ]; then
  rg -n "python_requires" setup.cfg || true
fi
if [ -f setup.py ]; then
  rg -n "python_requires" setup.py || true
fi
if [ -f tox.ini ]; then
  rg -n "envlist|basepython" tox.ini || true
fi

# Also sanity-check whether missing_ok is used elsewhere
rg -n "missing_ok=True" -S src . || true

Repository: MidnightDarling/jike-skill

Length of output: 2536


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,260p' tests/test_auth.py
rg -n "out|--out|_write_token_file|token file|symlink|O_NOFOLLOW|replace|atomic" tests/test_auth.py src/jike/auth.py

Repository: MidnightDarling/jike-skill

Length of output: 10604


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "_write_token_file|--out|Tokens written to|mode 0600|args\.out|O_TRUNC|replace\(" tests/test_auth.py

# If any matches exist, print the surrounding sections
for n in $(rg -n "_write_token_file|--out|Tokens written to|mode 0600|args\.out" tests/test_auth.py | cut -d: -f1 | sort -u); do
  echo "---- around line $n ----"
  sed -n "$((n-25)),$((n+80))p" tests/test_auth.py
done

Repository: MidnightDarling/jike-skill

Length of output: 52


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "args\.out|--out|_write_token_file|Tokens written" tests/test_auth.py || true
rg -n "symlink|O_NOFOLLOW|os\.replace|atomic" tests/test_auth.py || true

Repository: MidnightDarling/jike-skill

Length of output: 52


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== matches for --out / _write_token_file =="
rg -n "args\.out|--out|_write_token_file|Tokens written" tests/test_auth.py

echo "== matches for symlink / replace / atomic in tests =="
rg -n "symlink|O_NOFOLLOW|os\.replace|atomic" tests/test_auth.py || true

Repository: MidnightDarling/jike-skill

Length of output: 114


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '1,220p' pyproject.toml
rg -n "classifiers|Programming Language|Operating System|python_requires|platform" pyproject.toml || true

Repository: MidnightDarling/jike-skill

Length of output: 2653


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "Windows|win32|os\.name|POSIX|posix|nt\." README.md SECURITY.md SKILL.md pyproject.toml .github -S || true
ls -la .github || true
fd -t f .github -a || true

Repository: MidnightDarling/jike-skill

Length of output: 1725


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '140,230p' SKILL.md
sed -n '170,230p' README.md
sed -n '1,120p' .github/workflows/ci.yml

Repository: MidnightDarling/jike-skill

Length of output: 6759


Harden token --out writes against symlink clobbering and partial/truncated JSON.

  • src/jike/auth.py:184 truncates the destination with os.open(... O_TRUNC ...) on the user-provided path, which follows symlinks and can leave partial JSON if the process is interrupted.
  • Write to a temp file in path.parent (mode 0600), flush + fsync, then atomically commit with os.replace(tmp_path, path) (and ensure the parent dir exists) to avoid both symlink-target clobbering and incomplete writes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/jike/auth.py` around lines 183 - 187, The _write_token_file function
currently truncates the user-provided path (following symlinks) and can leave
partial JSON; change it to write atomically by creating a new temporary file in
path.parent (ensure the parent directory exists), open that temp with mode
0o600, write tokens.to_dict() JSON to it, flush and call os.fsync on the file
descriptor to persist data, close it, then atomically replace the target with
os.replace(tmp_path, path); this prevents symlink clobbering and truncated JSON
while keeping TokenPair serialization unchanged.

Comment thread src/jike/client_cli.py
Comment on lines +97 to +102
def _warn_process_visible_tokens(argv: list[str]) -> None:
if "--access-token" in argv or "--refresh-token" in argv:
print(
"[!] Token flags may be visible to other local users; prefer env vars.",
file=sys.stderr,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Token warning misses --access-token=... / --refresh-token=... forms.

Users passing equals-style token flags won’t see the process-visibility warning.

Suggested fix
 def _warn_process_visible_tokens(argv: list[str]) -> None:
-    if "--access-token" in argv or "--refresh-token" in argv:
+    if any(
+        arg == "--access-token"
+        or arg.startswith("--access-token=")
+        or arg == "--refresh-token"
+        or arg.startswith("--refresh-token=")
+        for arg in argv
+    ):
         print(
             "[!] Token flags may be visible to other local users; prefer env vars.",
             file=sys.stderr,
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _warn_process_visible_tokens(argv: list[str]) -> None:
if "--access-token" in argv or "--refresh-token" in argv:
print(
"[!] Token flags may be visible to other local users; prefer env vars.",
file=sys.stderr,
)
def _warn_process_visible_tokens(argv: list[str]) -> None:
if any(
arg == "--access-token"
or arg.startswith("--access-token=")
or arg == "--refresh-token"
or arg.startswith("--refresh-token=")
for arg in argv
):
print(
"[!] Token flags may be visible to other local users; prefer env vars.",
file=sys.stderr,
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/jike/client_cli.py` around lines 97 - 102, _warn_process_visible_tokens
currently only checks for "--access-token" and "--refresh-token" exact entries
and misses equals-style flags; update the check in _warn_process_visible_tokens
to detect either the flag alone or any argument that starts with
"--access-token=" or "--refresh-token=" (e.g., use membership OR startswith
checks against argv entries) so the warning is printed for both space-separated
and equals-style token usages.

Comment thread src/jike/client.py
Comment on lines 57 to 63
if resp.status_code == 401 and retry_on_401:
self._refresh()
self._refresh(previous_access_token=self._tokens.access_token)
return self._request(method, path, retry_on_401=False, **kwargs)

if resp.status_code == 429 and retry_on_429:
time.sleep(_retry_after(resp))
return self._request(method, path, retry_on_429=False, **kwargs)
resp.raise_for_status()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve retry flags across recursive retries.

Line 59 and Line 62 re-enter _request without carrying the other retry flag, which can silently re-enable a retry mode the caller disabled.

Suggested patch
         if resp.status_code == 401 and retry_on_401:
             self._refresh(previous_access_token=self._tokens.access_token)
-            return self._request(method, path, retry_on_401=False, **kwargs)
+            return self._request(
+                method,
+                path,
+                retry_on_401=False,
+                retry_on_429=retry_on_429,
+                **kwargs,
+            )
         if resp.status_code == 429 and retry_on_429:
             time.sleep(_retry_after(resp))
-            return self._request(method, path, retry_on_429=False, **kwargs)
+            return self._request(
+                method,
+                path,
+                retry_on_401=retry_on_401,
+                retry_on_429=False,
+                **kwargs,
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if resp.status_code == 401 and retry_on_401:
self._refresh()
self._refresh(previous_access_token=self._tokens.access_token)
return self._request(method, path, retry_on_401=False, **kwargs)
if resp.status_code == 429 and retry_on_429:
time.sleep(_retry_after(resp))
return self._request(method, path, retry_on_429=False, **kwargs)
resp.raise_for_status()
if resp.status_code == 401 and retry_on_401:
self._refresh(previous_access_token=self._tokens.access_token)
return self._request(
method,
path,
retry_on_401=False,
retry_on_429=retry_on_429,
**kwargs,
)
if resp.status_code == 429 and retry_on_429:
time.sleep(_retry_after(resp))
return self._request(
method,
path,
retry_on_401=retry_on_401,
retry_on_429=False,
**kwargs,
)
resp.raise_for_status()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/jike/client.py` around lines 57 - 63, The recursive calls in _request
drop the other retry flag; when handling a 401 you must call _request with
retry_on_401=False while preserving retry_on_429, and when handling a 429 call
_request with retry_on_429=False while preserving retry_on_401, so update the
two recursive invocations to pass both retry_on_401 and retry_on_429 (only
flipping the flag you consumed) so the caller's preferences are not silently
changed.

Comment thread src/jike/export_utils.py
Comment on lines +162 to +165
def atomic_write(path: Path, content: str) -> None:
tmp = path.with_suffix(path.suffix + ".partial")
tmp.write_text(content, encoding="utf-8")
os.replace(tmp, path)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ensure parent directories exist before atomic writes.

atomic_write can fail for valid nested output paths because the parent directory may not exist.

Suggested fix
 def atomic_write(path: Path, content: str) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
     tmp = path.with_suffix(path.suffix + ".partial")
     tmp.write_text(content, encoding="utf-8")
     os.replace(tmp, path)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/jike/export_utils.py` around lines 162 - 165, The atomic_write function
can fail when the target's parent directories don't exist; before creating the
temporary file (tmp) ensure its parent directory exists by calling Path.mkdir
with parents=True and exist_ok=True on tmp.parent (or path.parent) so the
directory tree is created before tmp.write_text and os.replace are called; keep
the rest of atomic_write (tmp name and os.replace) unchanged.

Comment thread src/jike/export.py
Comment on lines +86 to +98
while True:
page += 1
print(f" Fetching page {page}...", file=sys.stderr, end="", flush=True)
data, at, rt = fetch_user_posts(username, at, rt, load_more_key=load_more_key)
batch = data.get("data", [])
posts.extend(batch)
print(f" got {len(batch)} posts (total: {len(posts)})", file=sys.stderr)
load_more_key = data.get("loadMoreKey")
if checkpoint_path:
_save_checkpoint(checkpoint_path, username, posts, load_more_key)
if not load_more_key or not batch:
return posts, at, rt
time.sleep(RATE_LIMIT_DELAY)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard pagination against non-advancing cursors to prevent infinite loops.

If the API keeps returning a truthy loadMoreKey with non-empty data but doesn’t advance the cursor, this loop never terminates.

Suggested fix
 def fetch_all_posts(
@@
 ) -> tuple[list[dict], str, str]:
     posts, load_more_key = _load_checkpoint(checkpoint_path, username) if resume and checkpoint_path else ([], None)
     if resume and posts and load_more_key is None:
         return posts, at, rt
     page = 0
+    seen_cursors: set[str] = set()
     while True:
         page += 1
         print(f"  Fetching page {page}...", file=sys.stderr, end="", flush=True)
         data, at, rt = fetch_user_posts(username, at, rt, load_more_key=load_more_key)
         batch = data.get("data", [])
         posts.extend(batch)
         print(f" got {len(batch)} posts (total: {len(posts)})", file=sys.stderr)
         load_more_key = data.get("loadMoreKey")
+        if load_more_key is not None:
+            cursor_key = json.dumps(load_more_key, sort_keys=True, ensure_ascii=False)
+            if cursor_key in seen_cursors:
+                raise ValueError("pagination cursor did not advance")
+            seen_cursors.add(cursor_key)
         if checkpoint_path:
             _save_checkpoint(checkpoint_path, username, posts, load_more_key)
         if not load_more_key or not batch:
             return posts, at, rt
         time.sleep(RATE_LIMIT_DELAY)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/jike/export.py` around lines 86 - 98, The pagination loop in export_user
posts can hang if the API returns a truthy load_more_key but the cursor doesn't
advance; modify the loop in the block using fetch_user_posts to detect
non-advancing cursors by tracking the previous load_more_key (and optionally
previous (at, rt) or total posts count) and aborting after a small number of
repeated, unchanged cursors (or when returned batch length is zero despite a
truthy load_more_key); when aborting, persist the checkpoint via
_save_checkpoint and return the collected posts, at, rt; reference
variables/functions: fetch_user_posts, load_more_key, _save_checkpoint, posts,
page, RATE_LIMIT_DELAY.

Comment thread src/jike/export.py
Comment on lines +115 to +117
def _warn_process_visible_tokens(argv: list[str]) -> None:
if "--access-token" in argv or "--refresh-token" in argv:
print("[!] Token flags may be visible to other local users; prefer env vars.", file=sys.stderr)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Token warning misses --access-token=... / --refresh-token=... forms.

Current matching only catches split-arg style and silently skips warning for equals-style flags.

Suggested fix
 def _warn_process_visible_tokens(argv: list[str]) -> None:
-    if "--access-token" in argv or "--refresh-token" in argv:
+    if any(
+        arg == "--access-token"
+        or arg.startswith("--access-token=")
+        or arg == "--refresh-token"
+        or arg.startswith("--refresh-token=")
+        for arg in argv
+    ):
         print("[!] Token flags may be visible to other local users; prefer env vars.", file=sys.stderr)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/jike/export.py` around lines 115 - 117, _warn_process_visible_tokens
currently only detects separate flag tokens like "--access-token" or
"--refresh-token" and misses equals-style forms like "--access-token=...";
update the check to detect both forms by iterating argv and treating an arg as a
match if it equals the flag or startswith the flag + "=" (e.g., any(a ==
"--access-token" or a.startswith("--access-token=") or a == "--refresh-token" or
a.startswith("--refresh-token=") for a in argv)), then print the same stderr
warning when any match is found.

Copy link
Copy Markdown
Owner Author

@MidnightDarling MidnightDarling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/code-review --effort max 审查报告

5 angles × 8 候选并行扫描 → verifier 验证 → 严重度排序 → Top 15 已挂到对应代码行(见上方 inline 评论)。

严重度分布

级别 数量 涉及
安全 · 严重 3 SSRF 旁路(图片下载未禁止重定向)、Content-Type 头缺失绕过、os.open 0o600 mode 对已存在文件无效
安全/隐私 3 内联图 URL 无 host 白名单、images_dir 父目录 symlink TOCTOU、--access-token=VALUE 等号形式漏告警
可用性回归 3 validate_username 拒绝 CJK / 含点用户名、转发块条件收窄丢失数据、图片下载失败不再 fallback URL
逻辑 Bug 3 401/429 重试递归丢失对方开关、json_path_for 多点后缀覆盖、fetch_all_posts 无上限
稳健性 3 _load_checkpoint 非 dict 触发 AttributeError、丢失 KeyboardInterrupt 处理、refresh_tokens_extract_tokens 契约不一致

必须修复(合并前)

  1. 图片下载 allow_redirects=False(评论 #1
  2. Content-Type 默认拒绝(评论 #2
  3. _write_token_file 显式 fchmod(0o600) 或 tmp+replace(评论 #3
  4. 内联图 URL 走 host 白名单 / 警示降级(评论 #4)
  5. --access-token=VALUE 等号形式纳入告警判定(评论 #6)

建议修复(同 PR 内)

6/7/8/9/10/11/12/13/14/15 —— 详见对应 inline 评论。

测试缺口

  • tests/test_export.py:27 把 "notes.md.backupnotes.md.json" 这个 footgun 钉死成了 API 契约 —— 应在修复 json_path_for 时同步移除/翻转该断言。
  • 无任何并发写入测试覆盖 atomic_writedownload_image.partial 竞争(次级风险,未进 Top 15)。
  • --access-token=VALUE 等号形式告警的负面测试。

文档/配置层附带建议(未进 Top 15)

  • .github/workflows/ci.yml 的 wheel 内容校验列表里没有 jike_skill/SKILL.md,但 pyproject.toml force-include 了它 —— CI 假阴性窗口。
  • SKILL.md:123 Operations 表把 export 列在和 feed/post/search 同一组里,可能让用户误以为 python3 scripts/client.py export ... 也能跑(实际只走 __main__.py 路径)。
  • README.md 命令表只文档化了 jike-export,对 jike-api / jike-auth 这两个同样在 pyproject.toml 注册的 entry point 没有记录。

由于 PR 作者与审查账号同一,本次以 COMMENT 提交(不允许 self-REQUEST_CHANGES),实质等同请求修改。


Generated by Claude Code

Comment thread src/jike/export_utils.py
raise ValueError("images directory must not be a symlink")
if filepath.exists():
return _relative(filepath, output_dir)
with requests.get(safe, timeout=30, stream=True) as resp:
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[安全 · 严重] SSRF / 域名白名单旁路:未禁止重定向

requests.get(safe, timeout=30, stream=True) 未设置 allow_redirects=Falsehost_matches 只校验初始 URL 的主机,但 requests 默认会跟随 30x 跳转。
攻击场景:白名单内的 CDN(如 cdn.ruguoapp.com)或任意被攻击者影响的资源若返回 302 Location: http://169.254.169.254/latest/meta-data/http://127.0.0.1:6379/...,下载会跟随跳到任意内网/云元数据地址,并将其响应作为 "图片" 落盘为 post_NNNN_orig_N.png。允许列表给了一种虚假的安全感。

建议:

with requests.get(safe, timeout=30, stream=True, allow_redirects=False) as resp:
    if resp.is_redirect or resp.is_permanent_redirect:
        raise ValueError("image host returned a redirect")

或在跟随前手动验证每跳的 host_matches。


Generated by Claude Code

Comment thread src/jike/export_utils.py
resp.raise_for_status()
ctype = resp.headers.get("Content-Type", "").split(";", 1)[0].lower()
size = int(resp.headers.get("Content-Length", "0") or 0)
if ctype and ctype not in IMAGE_TYPES:
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[安全 · 严重] Content-Type 白名单可被「无头」响应绕过

if ctype and ctype not in IMAGE_TYPES —— 当响应缺失 Content-Type 头时,ctype 为空字符串,and ctype 短路成 False,整个 Content-Type 类型检查被跳过。攻击者只要让 CDN 边缘节点(或同源缓存)回一个不带 Content-Type 的 200 即可绕过类型校验,把任意小于 20 MB 的 HTML / 可执行文件 / shell 脚本落盘为 .png/.jpg,再被 Markdown 渲染器或下游消费者拉起。

修复(应"默认拒绝"):

ctype = resp.headers.get("Content-Type", "").split(";", 1)[0].strip().lower()
if ctype not in IMAGE_TYPES:
    raise ValueError("unexpected image content type")

Generated by Claude Code

Comment thread src/jike/auth.py
def _write_token_file(path: Path, tokens: TokenPair) -> None:
fd = os.open(path, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
with os.fdopen(fd, "w", encoding="utf-8") as f:
json.dump(tokens.to_dict(), f, indent=2)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[安全 · 严重] os.open 的 mode 参数对已存在文件无效 —— 0o600 承诺被虚假宣告

POSIX 语义:os.open(path, O_WRONLY|O_CREAT|O_TRUNC, 0o600) 仅在首次创建时应用 mode;若 tokens.json 已经存在(任意一次 re-run,或先前 touch tokens.json 留下的 0o644),mode 被静默忽略,但 auth.main() 仍然打印 [+] Tokens written to {args.out} with mode 0600 给用户错觉。Jike refresh_token 是长期凭据,世界可读的影响和 README 中"--out creates 0600"的保证完全不符 —— SECURITY.md 里也明确承诺了这个属性。

修复:

fd = os.open(path, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
try:
    os.fchmod(fd, 0o600)   # tighten regardless of pre-existing bits
    with os.fdopen(fd, "w", encoding="utf-8") as f:
        ...
except BaseException:
    os.close(fd); raise

或者写到临时文件 → os.chmod(0o600)os.replace(tmp, path),使最终 inode 一定是新创建的。


Generated by Claude Code

Comment thread src/jike/export_utils.py

def image_line(url: str, images: bool, images_dir: Optional[Path], output_dir: Path, index: int, key: str) -> Optional[str]:
src = download_image(url, images_dir, output_dir, index, key) if images and images_dir else url
return f"![img]({md_url(src)})" if src else None
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[安全/隐私] 当未 --download-images 时,内联到 Markdown 的 URL 未做 host 白名单

image_line(url, images=False, ...) 直接返回 [img]({md_url(url)})safe_url 只校验 http(s)+netloc校验主机。IMAGE_HOST_SUFFIXES 的白名单只在 download_image 分支里生效。

攻击场景:攻击者发一条即刻动态,pictures[0].picUrl 指向 https://attacker.example.com/track.png?u=victim。导出器把这条原样写入受害者本地 Markdown。受害者一旦用支持远程加载的 Markdown 渲染器(VS Code preview / Obsidian / 任意网页转换工具)打开导出文件,浏览器/插件就会替攻击者发起请求,泄露 IP/Referrer/UA,并构成持久化的"邮件追踪像素"等同物。

建议:对内联 URL 也走 host_matches(... , _image_host_suffixes()),命中失败时降级为纯文本占位符或保留原 URL 但用代码块标注不可信。


Generated by Claude Code

Comment thread src/jike/export_utils.py
try:
images_dir.mkdir(parents=True, exist_ok=True)
if images_dir.is_symlink():
raise ValueError("images directory must not be a symlink")
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[安全] images_dir TOCTOU:mkdir(parents=True, exist_ok=True) 会跟随父目录里的已存在 symlink

images_dir.is_symlink() 只检查叶节点;mkdir(parents=True, exist_ok=True)穿过任何已存在的中间符号链接组件,然后在被指向的目录里创建叶。叶本身不是 symlink,is_symlink() 返回 False,校验通过,但 tmp.open('wb') / os.replace 实际写到了攻击者控制的位置。

经验复现:

os.symlink("/tmp/attacker_dir", "alice_images")
Path("alice_images/sub").mkdir(parents=True, exist_ok=True)
# Path("alice_images/sub").is_symlink() → False
# 但 realpath 解析到 /tmp/attacker_dir/sub

修复(任一):

  • 收紧到 images_dir.mkdir(parents=False, exist_ok=True),并对解析后的 images_dir.resolve() 做白名单校验;
  • 或用 os.open(..., O_NOFOLLOW) + openat() 风格写入;
  • 或要求 images_dir 必须在 output_dir.resolve() 之下(前缀校验)。

Generated by Claude Code

Comment thread src/jike/export_utils.py
def json_path_for(output_path: str, username: str) -> Path:
if output_path == "-":
return Path(f"{validate_username(username)}_jike_export.json")
return Path(output_path).with_suffix(".json")
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug · 中] Path.with_suffix(".json") 只替换最末一段后缀 —— 多点文件名会被悄悄改写

Path("notes.md.backup").with_suffix(".json")Path("notes.md.json")(覆盖一个早就可能存在的 Markdown 导出 sidecar 同名 .json)。同样地:

  • Path("archive.tar.gz").with_suffix(".json")archive.tar.json(语义上的 .gz 信息丢了)
  • Path(".posts").with_suffix(".json")posts.json(前缀的 . 也被吃掉)

更糟糕的是 tests/test_export.py:27 把"notes.md.backupnotes.md.json"作为预期行为写进了断言,这等于把这个 footgun 钉死成了 API 契约。

建议:

def json_path_for(output_path, username):
    if output_path == "-":
        return Path(f"{validate_username(username)}_jike_export.json")
    p = Path(output_path)
    return p.with_name(p.name + ".json")   # 始终追加,不删除原后缀

并更新对应测试。


Generated by Claude Code

Comment thread src/jike/export.py
if resume and posts and load_more_key is None:
return posts, at, rt
page = 0
while True:
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[DoS · 中] fetch_all_posts 缺少最大页数 / 最大条数上限,可被恶意或异常分页打成 OOM + 磁盘满

while True: ... if not load_more_key or not batch: return —— 唯一的退出条件是服务器自愿把 loadMoreKey 设为 falsy 或返回空批。一个返回同一个 batch + 同一个非空 loadMoreKey 的不良/敌对服务端会让循环以每 0.5 秒一页的速度无限扩张 posts,同时不断重写 checkpoint 文件,直到进程被 OOM kill 或磁盘耗尽。

虽然这条 bug 在主分支 scripts/export.py 中已存在(即不是本 PR 新引入),但本 PR 是"加固导出"的窗口,正适合一并修掉:

MAX_PAGES = 5000
MAX_POSTS = 200_000
seen_keys: set[str] = set()
...
if page > MAX_PAGES or len(posts) > MAX_POSTS:
    raise ValueError("export aborted: pagination limit reached")
key_hash = json.dumps(load_more_key, sort_keys=True)
if key_hash in seen_keys:
    raise ValueError("export aborted: repeated loadMoreKey detected")
seen_keys.add(key_hash)

Generated by Claude Code

Comment thread src/jike/export.py
data = json.loads(path.read_text(encoding="utf-8"))
except (OSError, json.JSONDecodeError) as exc:
raise ValueError("invalid checkpoint file") from exc
if data.get("username") != username or not isinstance(data.get("posts"), list):
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug · 中] _load_checkpoint 在 checkpoint 顶层不是 dict 时抛 AttributeError,逃出现有 except

data = json.loads(path.read_text(encoding="utf-8"))
if data.get("username") != username or not isinstance(data.get("posts"), list):

当 checkpoint 文件顶层是列表([...])或纯量(数字 / 字符串)时,data.get(...) 不存在,触发 AttributeError_load_checkpoint 自己只 catch (OSError, json.JSONDecodeError)export.main 只 catch (requests.RequestException, ValueError) —— AttributeError 一路泄漏,CLI 终端不是 {"error": ...} 的统一格式,而是原始 traceback,违反了"对外只暴露 error class"的安全意图。

修复:

try:
    data = json.loads(path.read_text(encoding="utf-8"))
    if not isinstance(data, dict):
        raise ValueError("invalid checkpoint format")
except (OSError, json.JSONDecodeError) as exc:
    raise ValueError("invalid checkpoint file") from exc

Generated by Claude Code

Comment thread src/jike/export.py
sys.exit(1)
except ValueError as exc:
print(json.dumps({"error": exc.__class__.__name__}), file=sys.stderr)
sys.exit(1)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[UX 回归] 丢失了 KeyboardInterrupt → exit 130 的清洁处理

主分支 scripts/export.py:419-421

except KeyboardInterrupt:
    print("\nExport interrupted.", file=sys.stderr)
    sys.exit(130)

新版 src/jike/export.py:main 只 catch RequestException / ValueError。用户在长时间分页拉取(通常几十秒到几分钟)中 Ctrl-C 时,会得到原始 Python traceback 而不是干净退出码 130。CI、调度脚本、shell trap 等下游消费者依赖此契约。

建议在最外层加:

except KeyboardInterrupt:
    print(json.dumps({"error": "interrupted"}), file=sys.stderr)
    sys.exit(130)

Generated by Claude Code

Comment thread src/jike/auth.py
"x-jike-refresh-token", token_pair.refresh_token
),
access_token=resp.headers.get("x-jike-access-token", token_pair.access_token),
refresh_token=resp.headers.get("x-jike-refresh-token", token_pair.refresh_token),
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[一致性 · 低] refresh_tokens 只读 headers,与 _extract_tokens 兼顾 body+headers 的契约不一致

本 PR 加固了 _extract_tokens —— 支持从 JSON body 的 x-jike-access-token / access_token 字段或响应头中任一处取到 token。但 refresh_tokens 仍然只读响应头:

return TokenPair(
    access_token=resp.headers.get("x-jike-access-token", token_pair.access_token),
    refresh_token=resp.headers.get("x-jike-refresh-token", token_pair.refresh_token),
)

后果:若 Jike 服务端任何时候改成把轮换后的 token 放在 body 里(这正是 _extract_tokens 已经在准备应对的情形),refresh_tokens静默继续使用旧 token,下一次请求就会 401,并且循环触发 refresh,构成"刷不动"的死局。

建议:把 token 提取逻辑统一到 _extract_tokens(resp)refresh_tokens 也复用之,失败时返回旧 token + 警告。


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

/security-review 报告

方法论:从 /code-review --effort max 收集到 4 个安全候选 → 并行 false-positive filter 子任务 → 仅置信度 ≥ 8 的进入最终报告。


Vuln 1: SSRF(host 白名单可通过 HTTP 重定向旁路)— src/jike/export_utils.py:95

  • Severity: HIGH
  • Category: ssrf
  • Confidence: 8/10

Description
download_image 校验 URL host 必须命中 IMAGE_HOST_SUFFIXESokjike.com / ruguoapp.com / jellow.site,以及 JIKE_EXPORT_IMAGE_HOSTS 环境变量追加项),但随即调用 requests.get(safe, timeout=30, stream=True),未传 allow_redirects=Falserequests 默认会无声跟随 30x 跳转到任意 host。Host 白名单只对初始 URL 生效 —— PR 自述的 "image host/type/size validation" 在重定向链上并未真正实现。

Exploit Scenario

  1. 攻击者具备以下任一前提:
    • 在白名单 CDN 上有可被其影响的资源(账号上传后由 CDN 缓存 → 后续触发回源 302);
    • *.okjike.com 任一接入层存在开放重定向;
    • 白名单 CDN 的边缘节点在缓存淘汰 / 回源穿透时返回 30x。
  2. 重定向目标由攻击者选择,例如:
    • http://169.254.169.254/latest/meta-data/iam/security-credentials/(AWS IMDS → IAM 凭据)
    • http://metadata.google.internal/...(GCP 元数据服务)
    • http://internal-redis:6379/http://internal-elasticsearch:9200/...
    • http://attacker-controlled.example.com/anything
  3. requests 跟随跳转后,响应体被流式写入 images_dir/post_NNNN_orig_N.jpgContent-Length / Content-Type 在终态由攻击者控制,且 Content-Type 缺失时还有伴生绕过(见 /code-review 评论)。
  4. 内网云元数据凭据 / 内网服务响应 / 任意攻击者选定字节落到受害者磁盘,并被引用进导出的 Markdown。

这属于控 host + 控 protocol 的 SSRF(不是只控 path 的低危变种),影响超出"纯 SSRF" —— 因为响应体还会持久化到本地磁盘的受害者选定路径,引入二次利用面。

Recommendation

with requests.get(safe, timeout=30, stream=True, allow_redirects=False) as resp:
    if resp.is_redirect or resp.is_permanent_redirect:
        raise ValueError("image host returned a redirect")
    resp.raise_for_status()
    ...

或者手动解析 Location,对每一跳重新跑 host_matches 后再 GET。同样的硬化要应用于本仓任何未来加入的外部 HTTP fetch。


Filtered Findings(置信度 < 8 — 未纳入本报告,仅在 /code-review 中作为加固建议)

候选 文件:行 Filter 评级 落选原因
Content-Type 白名单可被缺失头绕过 export_utils.py:99 conf 4 / DROP 单独影响仅是"图片扩展名的字节落盘";要造成实际伤害需要与 SSRF 组合(已单列)。
os.open(..., 0o600) mode 对已存在文件无效 auth.py:184 conf 7 / DROP 文档承诺 vs 实现的差异真实存在,但触发条件需要预先有较宽松权限的同名文件;happy-path 安全。
images_dir.mkdir 的父目录 symlink TOCTOU export_utils.py:90 conf 4 / DROP 攻击者需要对受害者 CWD 或 --images-dir 父目录已有写权 —— 在那种前提下有更直接的攻击向量;未跨越权限边界。

以上三项仍然值得在本 PR 中一并修掉(属于"应硬化的次级问题"),详见 /code-review 在对应行上的 inline 评论。


总结:/security-review 维度下,1 个 HIGH 必修(重定向 SSRF),3 个建议级硬化。结合 /code-review 的 15 条 inline 评论一起处理。


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants