[Backend] Add content-based file validation (magic number checks) by SAI-HARISH2007 · Pull Request #667 · AOSSIE-Org/EduAid

SAI-HARISH2007 · 2026-04-15T09:27:43Z

Addressed Issues:

Enhances file upload security by validating file content instead of relying only on extensions

Screenshots/Recordings:

Not applicable (backend security improvement)

Additional Notes:

Summary

This PR introduces content-based validation for uploaded files using file signatures (magic numbers).

Previously, file validation relied only on file extensions, which could be bypassed by renaming malicious files. This change ensures that uploaded files match their expected format before processing.

Changes

Added magic number validation for:
- PDF (%PDF)
- DOCX (PK\x03\x04)
- TXT (valid UTF-8 text)
Integrated validation into the file processing pipeline before extraction
Maintained safe file handling with proper cleanup using try/finally
Continued use of secure filename handling and path validation

Impact

This prevents:

Malicious files disguised with fake extensions
Unexpected crashes from invalid file formats
Unsafe data being passed into parsing libraries

AI Usage Disclosure:

This PR contains AI-assisted code. I have read the AI Usage Policy and this PR complies with it. I have tested the changes locally and take full responsibility.

I identified the issue and designed the solution through manual analysis. AI tools were used minimally to refine implementation and validate edge cases.

Checklist

My PR addresses a single issue
My code follows project conventions
Documentation updated (not applicable)
Tests added (can be added if required)
No new warnings/errors introduced
Will share PR on Discord
Read contribution guidelines
Will address CodeRabbit comments
Template filled correctly

Summary by CodeRabbit

Bug Fixes
- Improved upload security: filename sanitization, directory traversal protection, and guaranteed cleanup of temporary uploads.
- Stronger file validation: checks file headers and internal structure (PDF, DOCX/ZIP, TXT) and rejects malformed or non-UTF-8 text files.
- Better error handling: clearer client errors for invalid inputs and safe server errors for unexpected failures.
Chores
- Added logging for validation and processing errors.

coderabbitai · 2026-04-15T09:32:51Z

📝 Walkthrough

Walkthrough

Added filename sanitization, absolute-path resolution to prevent directory traversal, binary-format signature checks for PDF/DOCX, UTF-8 validation for TXT, guaranteed file cleanup, and expanded upload route error handling with specific ValueError -> 400 and generic -> 500 responses.

Changes

Cohort / File(s)	Summary
File processing & validation `backend/Generator/main.py`	Sanitize filenames using `secure_filename`, validate non-empty names, construct absolute target paths and reject path traversal, replace extension-only dispatch with header/signature checks (PDF: `%PDF`, DOCX/ZIP: `PK\x03\x04` plus required internal members), validate TXT as UTF-8, and always delete saved file in a `finally` block; raise ValueError for unsupported/invalid content.
Upload route error handling `backend/server.py`	Wraps `file_processor.process_file(file)` with try/except: catches `ValueError` to log warning and return 400 with message, catches generic `Exception` to log error and return 500 with generic message; preserves previous success/empty-content behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Server as Server (/upload)
    participant Processor as FileProcessor
    participant FS as Filesystem

    Client->>Server: POST /upload (multipart file)
    Server->>Processor: process_file(file)
    Processor->>Processor: secure_filename -> resolve absolute target path
    Processor->>FS: save temp file
    Processor->>Processor: read header bytes, detect signature
    alt valid PDF/DOCX/TXT
        Processor->>Processor: full validation (DOCX ZIP members / TXT utf-8)
        Processor-->>Server: return processed content / empty
    else invalid or unsupported
        Processor-->>Server: raise ValueError("...invalid...")
    end
    Processor->>FS: delete saved file (finally)
    Server->>Client: 200 / 400 (ValueError) / 500 (Exception)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped through paths both safe and neat,

sanitized names and traversal beat.
I sniffed PDF and DOCX by sign,
checked TXT for UTF-8 fine.
Cleanup done — a tidy feat! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main security enhancement: adding content-based file validation using magic numbers. It directly reflects the primary change across both modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 406-412: The current TXT handling only decodes the 512-byte header
(header) which misses invalid UTF-8 later in the file; change the logic for the
ext == 'txt' branch to read the full file bytes from file_path (open in 'rb'),
attempt to decode the entire payload with utf-8 and if a UnicodeDecodeError
occurs raise ValueError("Invalid file content: TXT must be valid UTF-8 text");
on successful decode assign the decoded string to content (or use it when
opening as text) so the full file is validated before any further processing.
- Around line 372-418: process_file() now raises ValueError for bad input which
turns client errors into 500s because the upload route in backend/server.py
still expects a (content | None) return; update the upload route handler (the
function handling the "/upload" endpoint in backend/server.py) to catch
ValueError from process_file() and return a 400 response with the error message,
or alternatively change process_file() to preserve the old contract by returning
(None or an error indicator) instead of raising; locate process_file and the
upload route by name and ensure one of these two fixes is applied so client
input validation yields 400, not 500.
- Around line 401-404: The DOCX branch currently only checks the ZIP magic bytes
and then calls extract_text_from_docx(file_path), which allows non-DOCX ZIPs
through; before calling extract_text_from_docx (and before passing the file to
mammoth), open the file as a ZIP and verify DOCX-specific entries like
"[Content_Types].xml" and "word/document.xml" exist (use Python's zipfile to
inspect namelist()), and raise ValueError("Invalid DOCX: missing required
parts") if they are absent; update the ext == 'docx' branch that references
header and file_path to perform this validation prior to extraction.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 235bf715-e80f-49b2-a1b1-217cba8be2ff

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and 4ed5441.

📒 Files selected for processing (1)

backend/Generator/main.py

coderabbitai · 2026-04-15T09:36:20Z

+            elif ext == 'txt':
+                try:
+                    header.decode('utf-8')
+                except UnicodeDecodeError:
+                    raise ValueError("Invalid file content: TXT must be valid UTF-8 text")
+                with open(file_path, 'r', encoding='utf-8') as f:
+                    content = f.read()


⚠️ Potential issue | 🟠 Major

Validate the full text file, not just the first 512 bytes.

Lines 407-410 decode only the header. A file with valid UTF-8 in the first 512 bytes and invalid bytes later will pass this check and then fail during f.read(), bypassing the intended ValueError path. Decode the full payload once and raise ValueError on any UnicodeDecodeError.

Suggested fix

elif ext == 'txt': - try: - header.decode('utf-8') - except UnicodeDecodeError: - raise ValueError("Invalid file content: TXT must be valid UTF-8 text") - with open(file_path, 'r', encoding='utf-8') as f: - content = f.read() + with open(file_path, 'rb') as f: + raw = f.read() + try: + content = raw.decode('utf-8') + except UnicodeDecodeError as exc: + raise ValueError("Invalid file content: TXT must be valid UTF-8 text") from exc

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

elif ext == 'txt':

try:

header.decode('utf-8')

except UnicodeDecodeError:

raise ValueError("Invalid file content: TXT must be valid UTF-8 text")

with open(file_path, 'r', encoding='utf-8') as f:

content = f.read()

elif ext == 'txt':

with open(file_path, 'rb') as f:

raw = f.read()

try:

content = raw.decode('utf-8')

except UnicodeDecodeError as exc:

raise ValueError("Invalid file content: TXT must be valid UTF-8 text") from exc

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 406 - 412, The current TXT handling only decodes the 512-byte header (header) which misses invalid UTF-8 later in the file; change the logic for the ext == 'txt' branch to read the full file bytes from file_path (open in 'rb'), attempt to decode the entire payload with utf-8 and if a UnicodeDecodeError occurs raise ValueError("Invalid file content: TXT must be valid UTF-8 text"); on successful decode assign the decoded string to content (or use it when opening as text) so the full file is validated before any further processing.

coderabbitai

🧹 Nitpick comments (1)

backend/Generator/main.py (1)

411-420: Chain re-raised exceptions with from (Ruff B904).

When re-raising inside except, use raise ... from exc (or from None) to preserve/suppress the original cause explicitly. Matches the lint hint on Lines 412 and 420.

Proposed fix

-                try:
-                    with zipfile.ZipFile(file_path, 'r') as zf:
-                        namelist = zf.namelist()
-                        if '[Content_Types].xml' not in namelist or 'word/document.xml' not in namelist:
-                            raise ValueError("Invalid file content: Missing DOCX internal structures")
-                except zipfile.BadZipFile:
-                    raise ValueError("Invalid file content: Not a valid ZIP archive")
+                try:
+                    with zipfile.ZipFile(file_path, 'r') as zf:
+                        namelist = zf.namelist()
+                        if '[Content_Types].xml' not in namelist or 'word/document.xml' not in namelist:
+                            raise ValueError("Invalid file content: Missing DOCX internal structures")
+                except zipfile.BadZipFile as exc:
+                    raise ValueError("Invalid file content: Not a valid ZIP archive") from exc
...
-                try:
-                    header.decode('utf-8')
-                except UnicodeDecodeError:
-                    raise ValueError("Invalid file content: TXT must be valid UTF-8 text")
+                try:
+                    header.decode('utf-8')
+                except UnicodeDecodeError as exc:
+                    raise ValueError("Invalid file content: TXT must be valid UTF-8 text") from exc

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 411 - 420, The two exception
re-raises inside the file handling branch should chain the original exceptions
using "from" so the cause is preserved: change the zipfile.BadZipFile handler to
capture the exception (except zipfile.BadZipFile as exc) and re-raise
ValueError(...) from exc, and similarly capture the UnicodeDecodeError from
header.decode (except UnicodeDecodeError as exc) and re-raise ValueError(...)
from exc; update the handlers around the zip extraction and the TXT UTF-8 check
where extract_text_from_docx and header.decode are used.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 411-420: The two exception re-raises inside the file handling
branch should chain the original exceptions using "from" so the cause is
preserved: change the zipfile.BadZipFile handler to capture the exception
(except zipfile.BadZipFile as exc) and re-raise ValueError(...) from exc, and
similarly capture the UnicodeDecodeError from header.decode (except
UnicodeDecodeError as exc) and re-raise ValueError(...) from exc; update the
handlers around the zip extraction and the TXT UTF-8 check where
extract_text_from_docx and header.decode are used.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8978e201-6b60-485d-9873-417ac065848d

📥 Commits

Reviewing files that changed from the base of the PR and between 4ed5441 and ab46ce3.

📒 Files selected for processing (2)

backend/Generator/main.py
backend/server.py

feat: add content-based file validation (magic number checks)

4ed5441

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

fix: preserve upload API contract and improve DOCX validation

ab46ce3

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Backend] Add content-based file validation (magic number checks)#667

[Backend] Add content-based file validation (magic number checks)#667
SAI-HARISH2007 wants to merge 2 commits into
AOSSIE-Org:mainfrom
SAI-HARISH2007:add-magic-validation

SAI-HARISH2007 commented Apr 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 15, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

SAI-HARISH2007 commented Apr 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Screenshots/Recordings:

Additional Notes:

Summary

Changes

Impact

AI Usage Disclosure:

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SAI-HARISH2007 commented Apr 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading