[Backend] Add file upload validation and size limits by SAI-HARISH2007 · Pull Request #665 · AOSSIE-Org/EduAid

SAI-HARISH2007 · 2026-04-14T06:56:28Z

Addressed Issues:

Improves file upload safety and validation

Screenshots/Recordings:

Not applicable (backend validation improvement)

Additional Notes:

Summary

This PR introduces basic input validation for file uploads to improve the safety and stability of the backend.

Previously, the upload flow accepted files without checking their size or type, which could lead to server overload or invalid file processing.

Changes

Added a file size limit using Flask (MAX_CONTENT_LENGTH)
Restricted allowed file types to txt, pdf, and docx
Implemented early rejection for unsupported file types before processing
Added a custom error handler for large file uploads (413)

Impact

This helps prevent:

Uploading excessively large files
Processing unsupported or unexpected file types
Unnecessary load on backend services

AI Usage Disclosure:

This PR contains AI-assisted code. I have read the AI Usage Policy and this PR complies with it. I have tested the changes locally and take full responsibility.

I identified the issue and designed the solution through manual analysis. AI tools were used minimally to refine the implementation and improve edge-case handling.

Checklist

My PR addresses a single issue
My code follows the project's code style and conventions
Documentation updated (not applicable)
Tests added (can be added if required)
My changes generate no new warnings or errors
I have joined the Discord server and will share this PR there
I have read the contribution guidelines
I will address CodeRabbit review comments
I have filled this PR template completely

Summary by CodeRabbit

Release Notes

New Features
- File uploads now limited to 5 MB maximum with strict file type validation
- Supported file formats restricted to TXT, PDF, and DOCX for enhanced compatibility and security
Bug Fixes
- Improved temporary file cleanup during transcript extraction to guarantee resources are properly released even when processing encounters errors

coderabbitai · 2026-04-14T06:56:42Z

📝 Walkthrough

Walkthrough

New GSoC 2026 proposal document detailing backend hardening efforts, and corresponding security improvements to the Flask backend including file upload validation with extension whitelisting, request size limiting, and guaranteed temporary file cleanup via try/finally blocks.

Changes

Cohort / File(s)	Summary
Documentation `GSOC_PROPOSAL_DRAFT.md`	New GSoC 2026 proposal describing planned security, reliability, and scalability hardening work including file upload security, command injection mitigation, async task migration, rate limiting, and ML model isolation.
Backend Security Hardening `backend/server.py`	Added file extension whitelisting (`txt`, `pdf`, `docx`) with `allowed_file()` helper; set `MAX_CONTENT_LENGTH` to 5MB with `RequestEntityTooLarge` error handler; refactored `/getTranscript` to use `try/finally` for guaranteed cleanup of temporary `.vtt` subtitle files, with fallback sweep logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 With whiskers twitching, I inspect each file,
Extensions validated, dangers styled,
Five megs the limit, requests we deny,
Temp files swept—no mess left behind! 🔐✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes: adding file upload validation (extension whitelist) and size limits (MAX_CONTENT_LENGTH + 413 handler).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/server.py`:
- Around line 572-578: The code currently finds the newest file via
subtitle_files = glob.glob("subtitles/*.vtt") and latest_subtitle = max(...),
which is race-prone; instead make the subtitle output path request-scoped (e.g.,
create a unique temp dir or unique filename per request) and read that exact
path rather than scanning the shared "subtitles" folder. Update the code that
generates the .vtt to accept/return a deterministic path (or use
tempfile.mkdtemp()/NamedTemporaryFile) and then call clean_transcript() with
that specific path (replace the glob/max logic around
subtitle_files/latest_subtitle), and ensure cleanup happens only for that
request's file to avoid cross-request deletion.
- Around line 516-519: The filename is not sanitized before being passed to
FileProcessor, allowing path traversal via file.filename; sanitize the name
(e.g., use werkzeug.utils.secure_filename or at minimum strip path separators
and enforce os.path.basename and reject empty results) immediately after the
allowed_file check and before calling FileProcessor, and also update
FileProcessor where it constructs the path with upload_folder to always use the
sanitized filename (and reject filenames that are absolute or contain ..);
reference allowed_file, file.filename, FileProcessor, and upload_folder and
ensure both the upload entrypoint and the save logic use the same sanitized
value.
- Around line 567-570: Validate and sanitize the video_id before use: ensure the
value from request.args.get('videoId') matches a strict allowlist (e.g., regex
like /^[A-Za-z0-9_-]{11}$/ or your permitted pattern) and return a 400 error if
it fails; reject any value containing path separators, dots sequences, or glob
metacharacters. Use the validated/sanitized identifier when constructing the
output path for the subprocess.run call (reference variable video_id and the
subprocess.run invocation) by joining with a safe directory (e.g., using
pathlib.Path("subtitles") / safe_name) and avoid constructing glob patterns with
the raw input—either build the exact filename from the sanitized id for cleanup
or sanitize before passing to glob. Ensure all places that reference the
original video_id (the subprocess output -o argument and the cleanup glob usage)
use the validated safe_name instead of the raw input.

In `@GSOC_PROPOSAL_DRAFT.md`:
- Around line 57-70: The proposal text is out of sync with the shipped code:
update the paragraph so it reflects that the backend already uses argument-list
mode (subprocess.run([...]) with the video_id variable passed as an element)
instead of shell=True, and that the implemented upload allowlist contains txt,
pdf, and docx (not just PDF and DOCX); mention the validation and size limits as
implemented (application-layer checks plus upstream Nginx limits) and remove the
example implying shell=True or a different allowlist.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ebef0ddd-1bdb-4a3c-b950-a31b5674953c

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and 66a5dc7.

📒 Files selected for processing (2)

GSOC_PROPOSAL_DRAFT.md
backend/server.py

coderabbitai · 2026-04-14T07:01:23Z

+    # 3. Reject invalid files EARLY
+    if not allowed_file(file.filename):
+        return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify that the raw client filename still reaches the filesystem save path. sed -n '352,384p' backend/Generator/main.py echo rg -n -C2 'process_file|file\.filename|secure_filename|allowed_file' backend/server.py backend/Generator/main.py

Repository: AOSSIE-Org/EduAid

Length of output: 3015

Sanitize the filename before handing it to FileProcessor.

The allowlist only checks the extension. A name like ../../tmp/payload.txt still passes the validation and reaches backend/Generator/main.py:371 where os.path.join(self.upload_folder, file.filename) saves the file using the raw, unsanitized filename. This leaves the upload path traversal vulnerability open.

Suggested fix

+from werkzeug.utils import secure_filename + `@app.route`('/upload', methods=['POST']) def upload_file(): if 'file' not in request.files: return jsonify({"error": "No file part"}), 400 file = request.files['file'] if file.filename == '': return jsonify({"error": "No selected file"}), 400 - if not allowed_file(file.filename): + safe_name = secure_filename(file.filename) + if not safe_name or not allowed_file(safe_name): return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400 + file.filename = safe_name content = file_processor.process_file(file)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# 3. Reject invalid files EARLY

if not allowed_file(file.filename):

return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400

from werkzeug.utils import secure_filename

# (other code in the function...)

# 3. Reject invalid files EARLY

safe_name = secure_filename(file.filename)

if not safe_name or not allowed_file(safe_name):

return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400

file.filename = safe_name

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 516 - 519, The filename is not sanitized before being passed to FileProcessor, allowing path traversal via file.filename; sanitize the name (e.g., use werkzeug.utils.secure_filename or at minimum strip path separators and enforce os.path.basename and reject empty results) immediately after the allowed_file check and before calling FileProcessor, and also update FileProcessor where it constructs the path with upload_folder to always use the sanitized filename (and reject filenames that are absolute or contain ..); reference allowed_file, file.filename, FileProcessor, and upload_folder and ensure both the upload entrypoint and the save logic use the same sanitized value.

coderabbitai · 2026-04-14T07:01:23Z

+    try:
+        subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",
+                    "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"],
+                   check=True, capture_output=True, text=True)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify that raw video_id is interpolated into filesystem paths/globs. sed -n '558,592p' backend/server.py echo rg -n -C2 'video_id|subtitles/\{video_id\}|glob\.glob' backend/server.py

Repository: AOSSIE-Org/EduAid

Length of output: 2738

Validate video_id before using it in paths or glob patterns.

video_id from request.args.get('videoId') is used directly in the yt-dlp output path (line 569) and in the cleanup glob pattern (line 589) without validation. Characters like ../ or glob metacharacters (*, ?, [) can escape the intended directory or match unrelated files.

Suggested fix

`@app.route`('/getTranscript', methods=['GET']) def get_transcript(): video_id = request.args.get('videoId') if not video_id: return jsonify({"error": "No video ID provided"}), 400 + if not re.fullmatch(r"[A-Za-z0-9_-]{11}", video_id): + return jsonify({"error": "Invalid video ID"}), 400 transcript_text = None latest_subtitle = None

🧰 Tools

🪛 Ruff (0.15.9)

[error] 568-568: subprocess call: check for execution of untrusted input

(S603)

[error] 568-569: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 567 - 570, Validate and sanitize the video_id before use: ensure the value from request.args.get('videoId') matches a strict allowlist (e.g., regex like /^[A-Za-z0-9_-]{11}$/ or your permitted pattern) and return a 400 error if it fails; reject any value containing path separators, dots sequences, or glob metacharacters. Use the validated/sanitized identifier when constructing the output path for the subprocess.run call (reference variable video_id and the subprocess.run invocation) by joining with a safe directory (e.g., using pathlib.Path("subtitles") / safe_name) and avoid constructing glob patterns with the raw input—either build the exact filename from the sanitized id for cleanup or sanitize before passing to glob. Ensure all places that reference the original video_id (the subprocess output -o argument and the cleanup glob usage) use the validated safe_name instead of the raw input.

coderabbitai · 2026-04-14T07:01:23Z

+        # Find the latest .vtt file in the "subtitles" folder
+        subtitle_files = glob.glob("subtitles/*.vtt")
+        if not subtitle_files:
+            return jsonify({"error": "No subtitles found"}), 404
+
+        latest_subtitle = max(subtitle_files, key=os.path.getctime)
+        transcript_text = clean_transcript(latest_subtitle)


⚠️ Potential issue | 🟠 Major

Stop picking the “latest” subtitle from a shared directory.

This is still race-prone. Two concurrent requests can both write into subtitles/, then whichever file has the newest ctime gets read and deleted by both handlers. That can return the wrong transcript or break the other request. Use a request-scoped temp directory (or an exact output path you already know) instead of scanning subtitles/*.vtt.

Suggested direction

+import shutil +import uuid + `@app.route`('/getTranscript', methods=['GET']) def get_transcript(): video_id = request.args.get('videoId') if not video_id: return jsonify({"error": "No video ID provided"}), 400 transcript_text = None latest_subtitle = None + request_dir = os.path.join("subtitles", uuid.uuid4().hex) + os.makedirs(request_dir, exist_ok=True) try: + output_template = os.path.join(request_dir, "%(id)s.%(ext)s") subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download", - "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"], + "--sub-format", "vtt", "-o", output_template, f"https://www.youtube.com/watch?v={video_id}"], check=True, capture_output=True, text=True) - subtitle_files = glob.glob("subtitles/*.vtt") + subtitle_files = glob.glob(os.path.join(request_dir, "*.vtt")) if not subtitle_files: return jsonify({"error": "No subtitles found"}), 404 latest_subtitle = max(subtitle_files, key=os.path.getctime) transcript_text = clean_transcript(latest_subtitle) return jsonify({"transcript": transcript_text}) finally: - if latest_subtitle and os.path.exists(latest_subtitle): - app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}") - os.remove(latest_subtitle) - elif not latest_subtitle: - for stray in glob.glob(f"subtitles/{video_id}*.vtt"): - if os.path.exists(stray): - app.logger.info(f"Deleting stray temporary subtitle file: {stray}") - os.remove(stray) + shutil.rmtree(request_dir, ignore_errors=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 572 - 578, The code currently finds the newest file via subtitle_files = glob.glob("subtitles/*.vtt") and latest_subtitle = max(...), which is race-prone; instead make the subtitle output path request-scoped (e.g., create a unique temp dir or unique filename per request) and read that exact path rather than scanning the shared "subtitles" folder. Update the code that generates the .vtt to accept/return a deterministic path (or use tempfile.mkdtemp()/NamedTemporaryFile) and then call clean_transcript() with that specific path (replace the glob/max logic around subtitle_files/latest_subtitle), and ensure cleanup happens only for that request's file to avoid cross-request deletion.

coderabbitai · 2026-04-14T07:01:23Z

+The `yt-dlp` injection fix runs the same week. Right now `video_id` comes from user input and goes into a subprocess call with `shell=True`. I sanitize the parameter and switch to argument list mode so shell metacharacters are never interpreted:
+
+```python
+import re
+import subprocess
+
+if not re.fullmatch(r'[A-Za-z0-9_-]{6,16}', video_id):
+    raise ValueError("Invalid video ID")
+subprocess.run(["yt-dlp", video_id, "-o", "output.mp4"])
+```
+
+Test first here too. I pass `; rm -rf /tmp/test` as a video ID and confirm it gets blocked before touching the subprocess call.
+
+File upload validation also goes in this week: a whitelist of PDF and DOCX only, size limits at both the application layer and Nginx, and payload checks that reject oversized requests before Flask ever sees them.


⚠️ Potential issue | 🟡 Minor

Align this section with the code shipped in this PR.

This paragraph describes the current backend as using shell=True and says the upload whitelist will be “PDF and DOCX only,” but backend/server.py already uses list-mode subprocess.run(...) and the implemented allowlist is txt, pdf, docx. Updating the proposal text will keep reviewers from reading two different baselines in the same PR.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@GSOC_PROPOSAL_DRAFT.md` around lines 57 - 70, The proposal text is out of sync with the shipped code: update the paragraph so it reflects that the backend already uses argument-list mode (subprocess.run([...]) with the video_id variable passed as an element) instead of shell=True, and that the implemented upload allowlist contains txt, pdf, and docx (not just PDF and DOCX); mention the validation and size limits as implemented (application-layer checks plus upstream Nginx limits) and remove the example implying shell=True or a different allowlist.

SAI-HARISH2007 added 2 commits April 13, 2026 23:17

fix: ensure subtitle file cleanup on failure

e589f97

feat: add file upload validation and size limits

66a5dc7

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Backend] Add file upload validation and size limits#665

[Backend] Add file upload validation and size limits#665
SAI-HARISH2007 wants to merge 2 commits into
AOSSIE-Org:mainfrom
SAI-HARISH2007:add-upload-validation

SAI-HARISH2007 commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Uh oh!

coderabbitai Bot Apr 14, 2026

Uh oh!

coderabbitai Bot Apr 14, 2026

Uh oh!

coderabbitai Bot Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

SAI-HARISH2007 commented Apr 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Screenshots/Recordings:

Additional Notes:

Summary

Changes

Impact

AI Usage Disclosure:

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SAI-HARISH2007 commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading