Skip to content

[Backend] Add content-based file validation (magic number checks)#667

Open
SAI-HARISH2007 wants to merge 2 commits into
AOSSIE-Org:mainfrom
SAI-HARISH2007:add-magic-validation
Open

[Backend] Add content-based file validation (magic number checks)#667
SAI-HARISH2007 wants to merge 2 commits into
AOSSIE-Org:mainfrom
SAI-HARISH2007:add-magic-validation

Conversation

@SAI-HARISH2007
Copy link
Copy Markdown

@SAI-HARISH2007 SAI-HARISH2007 commented Apr 15, 2026

Addressed Issues:

Enhances file upload security by validating file content instead of relying only on extensions


Screenshots/Recordings:

Not applicable (backend security improvement)


Additional Notes:

Summary

This PR introduces content-based validation for uploaded files using file signatures (magic numbers).

Previously, file validation relied only on file extensions, which could be bypassed by renaming malicious files. This change ensures that uploaded files match their expected format before processing.

Changes

  • Added magic number validation for:
    • PDF (%PDF)
    • DOCX (PK\x03\x04)
    • TXT (valid UTF-8 text)
  • Integrated validation into the file processing pipeline before extraction
  • Maintained safe file handling with proper cleanup using try/finally
  • Continued use of secure filename handling and path validation

Impact

This prevents:

  • Malicious files disguised with fake extensions
  • Unexpected crashes from invalid file formats
  • Unsafe data being passed into parsing libraries

AI Usage Disclosure:

  • This PR contains AI-assisted code. I have read the AI Usage Policy and this PR complies with it. I have tested the changes locally and take full responsibility.

I identified the issue and designed the solution through manual analysis. AI tools were used minimally to refine implementation and validate edge cases.


Checklist

  • My PR addresses a single issue
  • My code follows project conventions
  • Documentation updated (not applicable)
  • Tests added (can be added if required)
  • No new warnings/errors introduced
  • Will share PR on Discord
  • Read contribution guidelines
  • Will address CodeRabbit comments
  • Template filled correctly

Summary by CodeRabbit

  • Bug Fixes

    • Improved upload security: filename sanitization, directory traversal protection, and guaranteed cleanup of temporary uploads.
    • Stronger file validation: checks file headers and internal structure (PDF, DOCX/ZIP, TXT) and rejects malformed or non-UTF-8 text files.
    • Better error handling: clearer client errors for invalid inputs and safe server errors for unexpected failures.
  • Chores

    • Added logging for validation and processing errors.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

Added filename sanitization, absolute-path resolution to prevent directory traversal, binary-format signature checks for PDF/DOCX, UTF-8 validation for TXT, guaranteed file cleanup, and expanded upload route error handling with specific ValueError -> 400 and generic -> 500 responses.

Changes

Cohort / File(s) Summary
File processing & validation
backend/Generator/main.py
Sanitize filenames using secure_filename, validate non-empty names, construct absolute target paths and reject path traversal, replace extension-only dispatch with header/signature checks (PDF: %PDF, DOCX/ZIP: PK\x03\x04 plus required internal members), validate TXT as UTF-8, and always delete saved file in a finally block; raise ValueError for unsupported/invalid content.
Upload route error handling
backend/server.py
Wraps file_processor.process_file(file) with try/except: catches ValueError to log warning and return 400 with message, catches generic Exception to log error and return 500 with generic message; preserves previous success/empty-content behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Server as Server (/upload)
    participant Processor as FileProcessor
    participant FS as Filesystem

    Client->>Server: POST /upload (multipart file)
    Server->>Processor: process_file(file)
    Processor->>Processor: secure_filename -> resolve absolute target path
    Processor->>FS: save temp file
    Processor->>Processor: read header bytes, detect signature
    alt valid PDF/DOCX/TXT
        Processor->>Processor: full validation (DOCX ZIP members / TXT utf-8)
        Processor-->>Server: return processed content / empty
    else invalid or unsupported
        Processor-->>Server: raise ValueError("...invalid...")
    end
    Processor->>FS: delete saved file (finally)
    Server->>Client: 200 / 400 (ValueError) / 500 (Exception)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped through paths both safe and neat,

sanitized names and traversal beat.
I sniffed PDF and DOCX by sign,
checked TXT for UTF-8 fine.
Cleanup done — a tidy feat! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main security enhancement: adding content-based file validation using magic numbers. It directly reflects the primary change across both modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 406-412: The current TXT handling only decodes the 512-byte header
(header) which misses invalid UTF-8 later in the file; change the logic for the
ext == 'txt' branch to read the full file bytes from file_path (open in 'rb'),
attempt to decode the entire payload with utf-8 and if a UnicodeDecodeError
occurs raise ValueError("Invalid file content: TXT must be valid UTF-8 text");
on successful decode assign the decoded string to content (or use it when
opening as text) so the full file is validated before any further processing.
- Around line 372-418: process_file() now raises ValueError for bad input which
turns client errors into 500s because the upload route in backend/server.py
still expects a (content | None) return; update the upload route handler (the
function handling the "/upload" endpoint in backend/server.py) to catch
ValueError from process_file() and return a 400 response with the error message,
or alternatively change process_file() to preserve the old contract by returning
(None or an error indicator) instead of raising; locate process_file and the
upload route by name and ensure one of these two fixes is applied so client
input validation yields 400, not 500.
- Around line 401-404: The DOCX branch currently only checks the ZIP magic bytes
and then calls extract_text_from_docx(file_path), which allows non-DOCX ZIPs
through; before calling extract_text_from_docx (and before passing the file to
mammoth), open the file as a ZIP and verify DOCX-specific entries like
"[Content_Types].xml" and "word/document.xml" exist (use Python's zipfile to
inspect namelist()), and raise ValueError("Invalid DOCX: missing required
parts") if they are absent; update the ext == 'docx' branch that references
header and file_path to perform this validation prior to extraction.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 235bf715-e80f-49b2-a1b1-217cba8be2ff

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and 4ed5441.

📒 Files selected for processing (1)
  • backend/Generator/main.py

Comment thread backend/Generator/main.py
Comment thread backend/Generator/main.py
Comment thread backend/Generator/main.py
Comment on lines +406 to +412
elif ext == 'txt':
try:
header.decode('utf-8')
except UnicodeDecodeError:
raise ValueError("Invalid file content: TXT must be valid UTF-8 text")
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate the full text file, not just the first 512 bytes.

Lines 407-410 decode only the header. A file with valid UTF-8 in the first 512 bytes and invalid bytes later will pass this check and then fail during f.read(), bypassing the intended ValueError path. Decode the full payload once and raise ValueError on any UnicodeDecodeError.

Suggested fix
             elif ext == 'txt':
-                try:
-                    header.decode('utf-8')
-                except UnicodeDecodeError:
-                    raise ValueError("Invalid file content: TXT must be valid UTF-8 text")
-                with open(file_path, 'r', encoding='utf-8') as f:
-                    content = f.read()
+                with open(file_path, 'rb') as f:
+                    raw = f.read()
+                try:
+                    content = raw.decode('utf-8')
+                except UnicodeDecodeError as exc:
+                    raise ValueError("Invalid file content: TXT must be valid UTF-8 text") from exc
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif ext == 'txt':
try:
header.decode('utf-8')
except UnicodeDecodeError:
raise ValueError("Invalid file content: TXT must be valid UTF-8 text")
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
elif ext == 'txt':
with open(file_path, 'rb') as f:
raw = f.read()
try:
content = raw.decode('utf-8')
except UnicodeDecodeError as exc:
raise ValueError("Invalid file content: TXT must be valid UTF-8 text") from exc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 406 - 412, The current TXT handling
only decodes the 512-byte header (header) which misses invalid UTF-8 later in
the file; change the logic for the ext == 'txt' branch to read the full file
bytes from file_path (open in 'rb'), attempt to decode the entire payload with
utf-8 and if a UnicodeDecodeError occurs raise ValueError("Invalid file content:
TXT must be valid UTF-8 text"); on successful decode assign the decoded string
to content (or use it when opening as text) so the full file is validated before
any further processing.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/Generator/main.py (1)

411-420: Chain re-raised exceptions with from (Ruff B904).

When re-raising inside except, use raise ... from exc (or from None) to preserve/suppress the original cause explicitly. Matches the lint hint on Lines 412 and 420.

Proposed fix
-                try:
-                    with zipfile.ZipFile(file_path, 'r') as zf:
-                        namelist = zf.namelist()
-                        if '[Content_Types].xml' not in namelist or 'word/document.xml' not in namelist:
-                            raise ValueError("Invalid file content: Missing DOCX internal structures")
-                except zipfile.BadZipFile:
-                    raise ValueError("Invalid file content: Not a valid ZIP archive")
+                try:
+                    with zipfile.ZipFile(file_path, 'r') as zf:
+                        namelist = zf.namelist()
+                        if '[Content_Types].xml' not in namelist or 'word/document.xml' not in namelist:
+                            raise ValueError("Invalid file content: Missing DOCX internal structures")
+                except zipfile.BadZipFile as exc:
+                    raise ValueError("Invalid file content: Not a valid ZIP archive") from exc
...
-                try:
-                    header.decode('utf-8')
-                except UnicodeDecodeError:
-                    raise ValueError("Invalid file content: TXT must be valid UTF-8 text")
+                try:
+                    header.decode('utf-8')
+                except UnicodeDecodeError as exc:
+                    raise ValueError("Invalid file content: TXT must be valid UTF-8 text") from exc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 411 - 420, The two exception
re-raises inside the file handling branch should chain the original exceptions
using "from" so the cause is preserved: change the zipfile.BadZipFile handler to
capture the exception (except zipfile.BadZipFile as exc) and re-raise
ValueError(...) from exc, and similarly capture the UnicodeDecodeError from
header.decode (except UnicodeDecodeError as exc) and re-raise ValueError(...)
from exc; update the handlers around the zip extraction and the TXT UTF-8 check
where extract_text_from_docx and header.decode are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 411-420: The two exception re-raises inside the file handling
branch should chain the original exceptions using "from" so the cause is
preserved: change the zipfile.BadZipFile handler to capture the exception
(except zipfile.BadZipFile as exc) and re-raise ValueError(...) from exc, and
similarly capture the UnicodeDecodeError from header.decode (except
UnicodeDecodeError as exc) and re-raise ValueError(...) from exc; update the
handlers around the zip extraction and the TXT UTF-8 check where
extract_text_from_docx and header.decode are used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8978e201-6b60-485d-9873-417ac065848d

📥 Commits

Reviewing files that changed from the base of the PR and between 4ed5441 and ab46ce3.

📒 Files selected for processing (2)
  • backend/Generator/main.py
  • backend/server.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant