Skip to content

[Backend] Add file upload validation and size limits#665

Open
SAI-HARISH2007 wants to merge 2 commits into
AOSSIE-Org:mainfrom
SAI-HARISH2007:add-upload-validation
Open

[Backend] Add file upload validation and size limits#665
SAI-HARISH2007 wants to merge 2 commits into
AOSSIE-Org:mainfrom
SAI-HARISH2007:add-upload-validation

Conversation

@SAI-HARISH2007
Copy link
Copy Markdown

@SAI-HARISH2007 SAI-HARISH2007 commented Apr 14, 2026

Addressed Issues:

Improves file upload safety and validation


Screenshots/Recordings:

Not applicable (backend validation improvement)


Additional Notes:

Summary

This PR introduces basic input validation for file uploads to improve the safety and stability of the backend.

Previously, the upload flow accepted files without checking their size or type, which could lead to server overload or invalid file processing.

Changes

  • Added a file size limit using Flask (MAX_CONTENT_LENGTH)
  • Restricted allowed file types to txt, pdf, and docx
  • Implemented early rejection for unsupported file types before processing
  • Added a custom error handler for large file uploads (413)

Impact

This helps prevent:

  • Uploading excessively large files
  • Processing unsupported or unexpected file types
  • Unnecessary load on backend services

AI Usage Disclosure:

  • This PR contains AI-assisted code. I have read the AI Usage Policy and this PR complies with it. I have tested the changes locally and take full responsibility.

I identified the issue and designed the solution through manual analysis. AI tools were used minimally to refine the implementation and improve edge-case handling.


Checklist

  • My PR addresses a single issue
  • My code follows the project's code style and conventions
  • Documentation updated (not applicable)
  • Tests added (can be added if required)
  • My changes generate no new warnings or errors
  • I have joined the Discord server and will share this PR there
  • I have read the contribution guidelines
  • I will address CodeRabbit review comments
  • I have filled this PR template completely

Summary by CodeRabbit

Release Notes

  • New Features

    • File uploads now limited to 5 MB maximum with strict file type validation
    • Supported file formats restricted to TXT, PDF, and DOCX for enhanced compatibility and security
  • Bug Fixes

    • Improved temporary file cleanup during transcript extraction to guarantee resources are properly released even when processing encounters errors

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

New GSoC 2026 proposal document detailing backend hardening efforts, and corresponding security improvements to the Flask backend including file upload validation with extension whitelisting, request size limiting, and guaranteed temporary file cleanup via try/finally blocks.

Changes

Cohort / File(s) Summary
Documentation
GSOC_PROPOSAL_DRAFT.md
New GSoC 2026 proposal describing planned security, reliability, and scalability hardening work including file upload security, command injection mitigation, async task migration, rate limiting, and ML model isolation.
Backend Security Hardening
backend/server.py
Added file extension whitelisting (txt, pdf, docx) with allowed_file() helper; set MAX_CONTENT_LENGTH to 5MB with RequestEntityTooLarge error handler; refactored /getTranscript to use try/finally for guaranteed cleanup of temporary .vtt subtitle files, with fallback sweep logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 With whiskers twitching, I inspect each file,
Extensions validated, dangers styled,
Five megs the limit, requests we deny,
Temp files swept—no mess left behind! 🔐✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: adding file upload validation (extension whitelist) and size limits (MAX_CONTENT_LENGTH + 413 handler).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/server.py`:
- Around line 572-578: The code currently finds the newest file via
subtitle_files = glob.glob("subtitles/*.vtt") and latest_subtitle = max(...),
which is race-prone; instead make the subtitle output path request-scoped (e.g.,
create a unique temp dir or unique filename per request) and read that exact
path rather than scanning the shared "subtitles" folder. Update the code that
generates the .vtt to accept/return a deterministic path (or use
tempfile.mkdtemp()/NamedTemporaryFile) and then call clean_transcript() with
that specific path (replace the glob/max logic around
subtitle_files/latest_subtitle), and ensure cleanup happens only for that
request's file to avoid cross-request deletion.
- Around line 516-519: The filename is not sanitized before being passed to
FileProcessor, allowing path traversal via file.filename; sanitize the name
(e.g., use werkzeug.utils.secure_filename or at minimum strip path separators
and enforce os.path.basename and reject empty results) immediately after the
allowed_file check and before calling FileProcessor, and also update
FileProcessor where it constructs the path with upload_folder to always use the
sanitized filename (and reject filenames that are absolute or contain ..);
reference allowed_file, file.filename, FileProcessor, and upload_folder and
ensure both the upload entrypoint and the save logic use the same sanitized
value.
- Around line 567-570: Validate and sanitize the video_id before use: ensure the
value from request.args.get('videoId') matches a strict allowlist (e.g., regex
like /^[A-Za-z0-9_-]{11}$/ or your permitted pattern) and return a 400 error if
it fails; reject any value containing path separators, dots sequences, or glob
metacharacters. Use the validated/sanitized identifier when constructing the
output path for the subprocess.run call (reference variable video_id and the
subprocess.run invocation) by joining with a safe directory (e.g., using
pathlib.Path("subtitles") / safe_name) and avoid constructing glob patterns with
the raw input—either build the exact filename from the sanitized id for cleanup
or sanitize before passing to glob. Ensure all places that reference the
original video_id (the subprocess output -o argument and the cleanup glob usage)
use the validated safe_name instead of the raw input.

In `@GSOC_PROPOSAL_DRAFT.md`:
- Around line 57-70: The proposal text is out of sync with the shipped code:
update the paragraph so it reflects that the backend already uses argument-list
mode (subprocess.run([...]) with the video_id variable passed as an element)
instead of shell=True, and that the implemented upload allowlist contains txt,
pdf, and docx (not just PDF and DOCX); mention the validation and size limits as
implemented (application-layer checks plus upstream Nginx limits) and remove the
example implying shell=True or a different allowlist.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ebef0ddd-1bdb-4a3c-b950-a31b5674953c

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and 66a5dc7.

📒 Files selected for processing (2)
  • GSOC_PROPOSAL_DRAFT.md
  • backend/server.py

Comment thread backend/server.py
Comment on lines +516 to +519
# 3. Reject invalid files EARLY
if not allowed_file(file.filename):
return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that the raw client filename still reaches the filesystem save path.
sed -n '352,384p' backend/Generator/main.py
echo
rg -n -C2 'process_file|file\.filename|secure_filename|allowed_file' backend/server.py backend/Generator/main.py

Repository: AOSSIE-Org/EduAid

Length of output: 3015


Sanitize the filename before handing it to FileProcessor.

The allowlist only checks the extension. A name like ../../tmp/payload.txt still passes the validation and reaches backend/Generator/main.py:371 where os.path.join(self.upload_folder, file.filename) saves the file using the raw, unsanitized filename. This leaves the upload path traversal vulnerability open.

Suggested fix
+from werkzeug.utils import secure_filename
+
 `@app.route`('/upload', methods=['POST'])
 def upload_file():
     if 'file' not in request.files:
         return jsonify({"error": "No file part"}), 400

     file = request.files['file']

     if file.filename == '':
         return jsonify({"error": "No selected file"}), 400

-    if not allowed_file(file.filename):
+    safe_name = secure_filename(file.filename)
+    if not safe_name or not allowed_file(safe_name):
         return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400
+    file.filename = safe_name

     content = file_processor.process_file(file)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 3. Reject invalid files EARLY
if not allowed_file(file.filename):
return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400
from werkzeug.utils import secure_filename
# (other code in the function...)
# 3. Reject invalid files EARLY
safe_name = secure_filename(file.filename)
if not safe_name or not allowed_file(safe_name):
return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400
file.filename = safe_name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 516 - 519, The filename is not sanitized
before being passed to FileProcessor, allowing path traversal via file.filename;
sanitize the name (e.g., use werkzeug.utils.secure_filename or at minimum strip
path separators and enforce os.path.basename and reject empty results)
immediately after the allowed_file check and before calling FileProcessor, and
also update FileProcessor where it constructs the path with upload_folder to
always use the sanitized filename (and reject filenames that are absolute or
contain ..); reference allowed_file, file.filename, FileProcessor, and
upload_folder and ensure both the upload entrypoint and the save logic use the
same sanitized value.

Comment thread backend/server.py
Comment on lines +567 to +570
try:
subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",
"--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"],
check=True, capture_output=True, text=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that raw video_id is interpolated into filesystem paths/globs.
sed -n '558,592p' backend/server.py
echo
rg -n -C2 'video_id|subtitles/\{video_id\}|glob\.glob' backend/server.py

Repository: AOSSIE-Org/EduAid

Length of output: 2738


Validate video_id before using it in paths or glob patterns.

video_id from request.args.get('videoId') is used directly in the yt-dlp output path (line 569) and in the cleanup glob pattern (line 589) without validation. Characters like ../ or glob metacharacters (*, ?, [) can escape the intended directory or match unrelated files.

Suggested fix
 `@app.route`('/getTranscript', methods=['GET'])
 def get_transcript():
     video_id = request.args.get('videoId')
     if not video_id:
         return jsonify({"error": "No video ID provided"}), 400
+    if not re.fullmatch(r"[A-Za-z0-9_-]{11}", video_id):
+        return jsonify({"error": "Invalid video ID"}), 400

     transcript_text = None
     latest_subtitle = None
🧰 Tools
🪛 Ruff (0.15.9)

[error] 568-568: subprocess call: check for execution of untrusted input

(S603)


[error] 568-569: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 567 - 570, Validate and sanitize the video_id
before use: ensure the value from request.args.get('videoId') matches a strict
allowlist (e.g., regex like /^[A-Za-z0-9_-]{11}$/ or your permitted pattern) and
return a 400 error if it fails; reject any value containing path separators,
dots sequences, or glob metacharacters. Use the validated/sanitized identifier
when constructing the output path for the subprocess.run call (reference
variable video_id and the subprocess.run invocation) by joining with a safe
directory (e.g., using pathlib.Path("subtitles") / safe_name) and avoid
constructing glob patterns with the raw input—either build the exact filename
from the sanitized id for cleanup or sanitize before passing to glob. Ensure all
places that reference the original video_id (the subprocess output -o argument
and the cleanup glob usage) use the validated safe_name instead of the raw
input.

Comment thread backend/server.py
Comment on lines +572 to +578
# Find the latest .vtt file in the "subtitles" folder
subtitle_files = glob.glob("subtitles/*.vtt")
if not subtitle_files:
return jsonify({"error": "No subtitles found"}), 404

latest_subtitle = max(subtitle_files, key=os.path.getctime)
transcript_text = clean_transcript(latest_subtitle)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stop picking the “latest” subtitle from a shared directory.

This is still race-prone. Two concurrent requests can both write into subtitles/, then whichever file has the newest ctime gets read and deleted by both handlers. That can return the wrong transcript or break the other request. Use a request-scoped temp directory (or an exact output path you already know) instead of scanning subtitles/*.vtt.

Suggested direction
+import shutil
+import uuid
+
 `@app.route`('/getTranscript', methods=['GET'])
 def get_transcript():
     video_id = request.args.get('videoId')
     if not video_id:
         return jsonify({"error": "No video ID provided"}), 400

     transcript_text = None
     latest_subtitle = None
+    request_dir = os.path.join("subtitles", uuid.uuid4().hex)
+    os.makedirs(request_dir, exist_ok=True)

     try:
+        output_template = os.path.join(request_dir, "%(id)s.%(ext)s")
         subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",
-                    "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"],
+                    "--sub-format", "vtt", "-o", output_template, f"https://www.youtube.com/watch?v={video_id}"],
                    check=True, capture_output=True, text=True)

-        subtitle_files = glob.glob("subtitles/*.vtt")
+        subtitle_files = glob.glob(os.path.join(request_dir, "*.vtt"))
         if not subtitle_files:
             return jsonify({"error": "No subtitles found"}), 404

         latest_subtitle = max(subtitle_files, key=os.path.getctime)
         transcript_text = clean_transcript(latest_subtitle)

         return jsonify({"transcript": transcript_text})

     finally:
-        if latest_subtitle and os.path.exists(latest_subtitle):
-            app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}")
-            os.remove(latest_subtitle)
-        elif not latest_subtitle:
-            for stray in glob.glob(f"subtitles/{video_id}*.vtt"):
-                if os.path.exists(stray):
-                    app.logger.info(f"Deleting stray temporary subtitle file: {stray}")
-                    os.remove(stray)
+        shutil.rmtree(request_dir, ignore_errors=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 572 - 578, The code currently finds the
newest file via subtitle_files = glob.glob("subtitles/*.vtt") and
latest_subtitle = max(...), which is race-prone; instead make the subtitle
output path request-scoped (e.g., create a unique temp dir or unique filename
per request) and read that exact path rather than scanning the shared
"subtitles" folder. Update the code that generates the .vtt to accept/return a
deterministic path (or use tempfile.mkdtemp()/NamedTemporaryFile) and then call
clean_transcript() with that specific path (replace the glob/max logic around
subtitle_files/latest_subtitle), and ensure cleanup happens only for that
request's file to avoid cross-request deletion.

Comment thread GSOC_PROPOSAL_DRAFT.md
Comment on lines +57 to +70
The `yt-dlp` injection fix runs the same week. Right now `video_id` comes from user input and goes into a subprocess call with `shell=True`. I sanitize the parameter and switch to argument list mode so shell metacharacters are never interpreted:

```python
import re
import subprocess

if not re.fullmatch(r'[A-Za-z0-9_-]{6,16}', video_id):
raise ValueError("Invalid video ID")
subprocess.run(["yt-dlp", video_id, "-o", "output.mp4"])
```

Test first here too. I pass `; rm -rf /tmp/test` as a video ID and confirm it gets blocked before touching the subprocess call.

File upload validation also goes in this week: a whitelist of PDF and DOCX only, size limits at both the application layer and Nginx, and payload checks that reject oversized requests before Flask ever sees them.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align this section with the code shipped in this PR.

This paragraph describes the current backend as using shell=True and says the upload whitelist will be “PDF and DOCX only,” but backend/server.py already uses list-mode subprocess.run(...) and the implemented allowlist is txt, pdf, docx. Updating the proposal text will keep reviewers from reading two different baselines in the same PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GSOC_PROPOSAL_DRAFT.md` around lines 57 - 70, The proposal text is out of
sync with the shipped code: update the paragraph so it reflects that the backend
already uses argument-list mode (subprocess.run([...]) with the video_id
variable passed as an element) instead of shell=True, and that the implemented
upload allowlist contains txt, pdf, and docx (not just PDF and DOCX); mention
the validation and size limits as implemented (application-layer checks plus
upstream Nginx limits) and remove the example implying shell=True or a different
allowlist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant