[Backend] Add content-based file validation (magic number checks)#667
[Backend] Add content-based file validation (magic number checks)#667SAI-HARISH2007 wants to merge 2 commits into
Conversation
📝 WalkthroughWalkthroughAdded filename sanitization, absolute-path resolution to prevent directory traversal, binary-format signature checks for PDF/DOCX, UTF-8 validation for TXT, guaranteed file cleanup, and expanded upload route error handling with specific ValueError -> 400 and generic -> 500 responses. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Server as Server (/upload)
participant Processor as FileProcessor
participant FS as Filesystem
Client->>Server: POST /upload (multipart file)
Server->>Processor: process_file(file)
Processor->>Processor: secure_filename -> resolve absolute target path
Processor->>FS: save temp file
Processor->>Processor: read header bytes, detect signature
alt valid PDF/DOCX/TXT
Processor->>Processor: full validation (DOCX ZIP members / TXT utf-8)
Processor-->>Server: return processed content / empty
else invalid or unsupported
Processor-->>Server: raise ValueError("...invalid...")
end
Processor->>FS: delete saved file (finally)
Server->>Client: 200 / 400 (ValueError) / 500 (Exception)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/main.py`:
- Around line 406-412: The current TXT handling only decodes the 512-byte header
(header) which misses invalid UTF-8 later in the file; change the logic for the
ext == 'txt' branch to read the full file bytes from file_path (open in 'rb'),
attempt to decode the entire payload with utf-8 and if a UnicodeDecodeError
occurs raise ValueError("Invalid file content: TXT must be valid UTF-8 text");
on successful decode assign the decoded string to content (or use it when
opening as text) so the full file is validated before any further processing.
- Around line 372-418: process_file() now raises ValueError for bad input which
turns client errors into 500s because the upload route in backend/server.py
still expects a (content | None) return; update the upload route handler (the
function handling the "/upload" endpoint in backend/server.py) to catch
ValueError from process_file() and return a 400 response with the error message,
or alternatively change process_file() to preserve the old contract by returning
(None or an error indicator) instead of raising; locate process_file and the
upload route by name and ensure one of these two fixes is applied so client
input validation yields 400, not 500.
- Around line 401-404: The DOCX branch currently only checks the ZIP magic bytes
and then calls extract_text_from_docx(file_path), which allows non-DOCX ZIPs
through; before calling extract_text_from_docx (and before passing the file to
mammoth), open the file as a ZIP and verify DOCX-specific entries like
"[Content_Types].xml" and "word/document.xml" exist (use Python's zipfile to
inspect namelist()), and raise ValueError("Invalid DOCX: missing required
parts") if they are absent; update the ext == 'docx' branch that references
header and file_path to perform this validation prior to extraction.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| elif ext == 'txt': | ||
| try: | ||
| header.decode('utf-8') | ||
| except UnicodeDecodeError: | ||
| raise ValueError("Invalid file content: TXT must be valid UTF-8 text") | ||
| with open(file_path, 'r', encoding='utf-8') as f: | ||
| content = f.read() |
There was a problem hiding this comment.
Validate the full text file, not just the first 512 bytes.
Lines 407-410 decode only the header. A file with valid UTF-8 in the first 512 bytes and invalid bytes later will pass this check and then fail during f.read(), bypassing the intended ValueError path. Decode the full payload once and raise ValueError on any UnicodeDecodeError.
Suggested fix
elif ext == 'txt':
- try:
- header.decode('utf-8')
- except UnicodeDecodeError:
- raise ValueError("Invalid file content: TXT must be valid UTF-8 text")
- with open(file_path, 'r', encoding='utf-8') as f:
- content = f.read()
+ with open(file_path, 'rb') as f:
+ raw = f.read()
+ try:
+ content = raw.decode('utf-8')
+ except UnicodeDecodeError as exc:
+ raise ValueError("Invalid file content: TXT must be valid UTF-8 text") from exc📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif ext == 'txt': | |
| try: | |
| header.decode('utf-8') | |
| except UnicodeDecodeError: | |
| raise ValueError("Invalid file content: TXT must be valid UTF-8 text") | |
| with open(file_path, 'r', encoding='utf-8') as f: | |
| content = f.read() | |
| elif ext == 'txt': | |
| with open(file_path, 'rb') as f: | |
| raw = f.read() | |
| try: | |
| content = raw.decode('utf-8') | |
| except UnicodeDecodeError as exc: | |
| raise ValueError("Invalid file content: TXT must be valid UTF-8 text") from exc |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/Generator/main.py` around lines 406 - 412, The current TXT handling
only decodes the 512-byte header (header) which misses invalid UTF-8 later in
the file; change the logic for the ext == 'txt' branch to read the full file
bytes from file_path (open in 'rb'), attempt to decode the entire payload with
utf-8 and if a UnicodeDecodeError occurs raise ValueError("Invalid file content:
TXT must be valid UTF-8 text"); on successful decode assign the decoded string
to content (or use it when opening as text) so the full file is validated before
any further processing.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/Generator/main.py (1)
411-420: Chain re-raised exceptions withfrom(Ruff B904).When re-raising inside
except, useraise ... from exc(orfrom None) to preserve/suppress the original cause explicitly. Matches the lint hint on Lines 412 and 420.Proposed fix
- try: - with zipfile.ZipFile(file_path, 'r') as zf: - namelist = zf.namelist() - if '[Content_Types].xml' not in namelist or 'word/document.xml' not in namelist: - raise ValueError("Invalid file content: Missing DOCX internal structures") - except zipfile.BadZipFile: - raise ValueError("Invalid file content: Not a valid ZIP archive") + try: + with zipfile.ZipFile(file_path, 'r') as zf: + namelist = zf.namelist() + if '[Content_Types].xml' not in namelist or 'word/document.xml' not in namelist: + raise ValueError("Invalid file content: Missing DOCX internal structures") + except zipfile.BadZipFile as exc: + raise ValueError("Invalid file content: Not a valid ZIP archive") from exc ... - try: - header.decode('utf-8') - except UnicodeDecodeError: - raise ValueError("Invalid file content: TXT must be valid UTF-8 text") + try: + header.decode('utf-8') + except UnicodeDecodeError as exc: + raise ValueError("Invalid file content: TXT must be valid UTF-8 text") from exc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 411 - 420, The two exception re-raises inside the file handling branch should chain the original exceptions using "from" so the cause is preserved: change the zipfile.BadZipFile handler to capture the exception (except zipfile.BadZipFile as exc) and re-raise ValueError(...) from exc, and similarly capture the UnicodeDecodeError from header.decode (except UnicodeDecodeError as exc) and re-raise ValueError(...) from exc; update the handlers around the zip extraction and the TXT UTF-8 check where extract_text_from_docx and header.decode are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 411-420: The two exception re-raises inside the file handling
branch should chain the original exceptions using "from" so the cause is
preserved: change the zipfile.BadZipFile handler to capture the exception
(except zipfile.BadZipFile as exc) and re-raise ValueError(...) from exc, and
similarly capture the UnicodeDecodeError from header.decode (except
UnicodeDecodeError as exc) and re-raise ValueError(...) from exc; update the
handlers around the zip extraction and the TXT UTF-8 check where
extract_text_from_docx and header.decode are used.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8978e201-6b60-485d-9873-417ac065848d
📒 Files selected for processing (2)
backend/Generator/main.pybackend/server.py
Addressed Issues:
Enhances file upload security by validating file content instead of relying only on extensions
Screenshots/Recordings:
Not applicable (backend security improvement)
Additional Notes:
Summary
This PR introduces content-based validation for uploaded files using file signatures (magic numbers).
Previously, file validation relied only on file extensions, which could be bypassed by renaming malicious files. This change ensures that uploaded files match their expected format before processing.
Changes
%PDF)PK\x03\x04)Impact
This prevents:
AI Usage Disclosure:
I identified the issue and designed the solution through manual analysis. AI tools were used minimally to refine implementation and validate edge cases.
Checklist
Summary by CodeRabbit
Bug Fixes
Chores