[Backend] Add file upload validation and size limits#665
[Backend] Add file upload validation and size limits#665SAI-HARISH2007 wants to merge 2 commits into
Conversation
📝 WalkthroughWalkthroughNew GSoC 2026 proposal document detailing backend hardening efforts, and corresponding security improvements to the Flask backend including file upload validation with extension whitelisting, request size limiting, and guaranteed temporary file cleanup via try/finally blocks. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/server.py`:
- Around line 572-578: The code currently finds the newest file via
subtitle_files = glob.glob("subtitles/*.vtt") and latest_subtitle = max(...),
which is race-prone; instead make the subtitle output path request-scoped (e.g.,
create a unique temp dir or unique filename per request) and read that exact
path rather than scanning the shared "subtitles" folder. Update the code that
generates the .vtt to accept/return a deterministic path (or use
tempfile.mkdtemp()/NamedTemporaryFile) and then call clean_transcript() with
that specific path (replace the glob/max logic around
subtitle_files/latest_subtitle), and ensure cleanup happens only for that
request's file to avoid cross-request deletion.
- Around line 516-519: The filename is not sanitized before being passed to
FileProcessor, allowing path traversal via file.filename; sanitize the name
(e.g., use werkzeug.utils.secure_filename or at minimum strip path separators
and enforce os.path.basename and reject empty results) immediately after the
allowed_file check and before calling FileProcessor, and also update
FileProcessor where it constructs the path with upload_folder to always use the
sanitized filename (and reject filenames that are absolute or contain ..);
reference allowed_file, file.filename, FileProcessor, and upload_folder and
ensure both the upload entrypoint and the save logic use the same sanitized
value.
- Around line 567-570: Validate and sanitize the video_id before use: ensure the
value from request.args.get('videoId') matches a strict allowlist (e.g., regex
like /^[A-Za-z0-9_-]{11}$/ or your permitted pattern) and return a 400 error if
it fails; reject any value containing path separators, dots sequences, or glob
metacharacters. Use the validated/sanitized identifier when constructing the
output path for the subprocess.run call (reference variable video_id and the
subprocess.run invocation) by joining with a safe directory (e.g., using
pathlib.Path("subtitles") / safe_name) and avoid constructing glob patterns with
the raw input—either build the exact filename from the sanitized id for cleanup
or sanitize before passing to glob. Ensure all places that reference the
original video_id (the subprocess output -o argument and the cleanup glob usage)
use the validated safe_name instead of the raw input.
In `@GSOC_PROPOSAL_DRAFT.md`:
- Around line 57-70: The proposal text is out of sync with the shipped code:
update the paragraph so it reflects that the backend already uses argument-list
mode (subprocess.run([...]) with the video_id variable passed as an element)
instead of shell=True, and that the implemented upload allowlist contains txt,
pdf, and docx (not just PDF and DOCX); mention the validation and size limits as
implemented (application-layer checks plus upstream Nginx limits) and remove the
example implying shell=True or a different allowlist.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ebef0ddd-1bdb-4a3c-b950-a31b5674953c
📒 Files selected for processing (2)
GSOC_PROPOSAL_DRAFT.mdbackend/server.py
| # 3. Reject invalid files EARLY | ||
| if not allowed_file(file.filename): | ||
| return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400 | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that the raw client filename still reaches the filesystem save path.
sed -n '352,384p' backend/Generator/main.py
echo
rg -n -C2 'process_file|file\.filename|secure_filename|allowed_file' backend/server.py backend/Generator/main.pyRepository: AOSSIE-Org/EduAid
Length of output: 3015
Sanitize the filename before handing it to FileProcessor.
The allowlist only checks the extension. A name like ../../tmp/payload.txt still passes the validation and reaches backend/Generator/main.py:371 where os.path.join(self.upload_folder, file.filename) saves the file using the raw, unsanitized filename. This leaves the upload path traversal vulnerability open.
Suggested fix
+from werkzeug.utils import secure_filename
+
`@app.route`('/upload', methods=['POST'])
def upload_file():
if 'file' not in request.files:
return jsonify({"error": "No file part"}), 400
file = request.files['file']
if file.filename == '':
return jsonify({"error": "No selected file"}), 400
- if not allowed_file(file.filename):
+ safe_name = secure_filename(file.filename)
+ if not safe_name or not allowed_file(safe_name):
return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400
+ file.filename = safe_name
content = file_processor.process_file(file)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 3. Reject invalid files EARLY | |
| if not allowed_file(file.filename): | |
| return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400 | |
| from werkzeug.utils import secure_filename | |
| # (other code in the function...) | |
| # 3. Reject invalid files EARLY | |
| safe_name = secure_filename(file.filename) | |
| if not safe_name or not allowed_file(safe_name): | |
| return jsonify({"error": "File type not allowed. Supported types: txt, pdf, docx"}), 400 | |
| file.filename = safe_name |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 516 - 519, The filename is not sanitized
before being passed to FileProcessor, allowing path traversal via file.filename;
sanitize the name (e.g., use werkzeug.utils.secure_filename or at minimum strip
path separators and enforce os.path.basename and reject empty results)
immediately after the allowed_file check and before calling FileProcessor, and
also update FileProcessor where it constructs the path with upload_folder to
always use the sanitized filename (and reject filenames that are absolute or
contain ..); reference allowed_file, file.filename, FileProcessor, and
upload_folder and ensure both the upload entrypoint and the save logic use the
same sanitized value.
| try: | ||
| subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download", | ||
| "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"], | ||
| check=True, capture_output=True, text=True) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that raw video_id is interpolated into filesystem paths/globs.
sed -n '558,592p' backend/server.py
echo
rg -n -C2 'video_id|subtitles/\{video_id\}|glob\.glob' backend/server.pyRepository: AOSSIE-Org/EduAid
Length of output: 2738
Validate video_id before using it in paths or glob patterns.
video_id from request.args.get('videoId') is used directly in the yt-dlp output path (line 569) and in the cleanup glob pattern (line 589) without validation. Characters like ../ or glob metacharacters (*, ?, [) can escape the intended directory or match unrelated files.
Suggested fix
`@app.route`('/getTranscript', methods=['GET'])
def get_transcript():
video_id = request.args.get('videoId')
if not video_id:
return jsonify({"error": "No video ID provided"}), 400
+ if not re.fullmatch(r"[A-Za-z0-9_-]{11}", video_id):
+ return jsonify({"error": "Invalid video ID"}), 400
transcript_text = None
latest_subtitle = None🧰 Tools
🪛 Ruff (0.15.9)
[error] 568-568: subprocess call: check for execution of untrusted input
(S603)
[error] 568-569: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 567 - 570, Validate and sanitize the video_id
before use: ensure the value from request.args.get('videoId') matches a strict
allowlist (e.g., regex like /^[A-Za-z0-9_-]{11}$/ or your permitted pattern) and
return a 400 error if it fails; reject any value containing path separators,
dots sequences, or glob metacharacters. Use the validated/sanitized identifier
when constructing the output path for the subprocess.run call (reference
variable video_id and the subprocess.run invocation) by joining with a safe
directory (e.g., using pathlib.Path("subtitles") / safe_name) and avoid
constructing glob patterns with the raw input—either build the exact filename
from the sanitized id for cleanup or sanitize before passing to glob. Ensure all
places that reference the original video_id (the subprocess output -o argument
and the cleanup glob usage) use the validated safe_name instead of the raw
input.
| # Find the latest .vtt file in the "subtitles" folder | ||
| subtitle_files = glob.glob("subtitles/*.vtt") | ||
| if not subtitle_files: | ||
| return jsonify({"error": "No subtitles found"}), 404 | ||
|
|
||
| latest_subtitle = max(subtitle_files, key=os.path.getctime) | ||
| transcript_text = clean_transcript(latest_subtitle) |
There was a problem hiding this comment.
Stop picking the “latest” subtitle from a shared directory.
This is still race-prone. Two concurrent requests can both write into subtitles/, then whichever file has the newest ctime gets read and deleted by both handlers. That can return the wrong transcript or break the other request. Use a request-scoped temp directory (or an exact output path you already know) instead of scanning subtitles/*.vtt.
Suggested direction
+import shutil
+import uuid
+
`@app.route`('/getTranscript', methods=['GET'])
def get_transcript():
video_id = request.args.get('videoId')
if not video_id:
return jsonify({"error": "No video ID provided"}), 400
transcript_text = None
latest_subtitle = None
+ request_dir = os.path.join("subtitles", uuid.uuid4().hex)
+ os.makedirs(request_dir, exist_ok=True)
try:
+ output_template = os.path.join(request_dir, "%(id)s.%(ext)s")
subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",
- "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"],
+ "--sub-format", "vtt", "-o", output_template, f"https://www.youtube.com/watch?v={video_id}"],
check=True, capture_output=True, text=True)
- subtitle_files = glob.glob("subtitles/*.vtt")
+ subtitle_files = glob.glob(os.path.join(request_dir, "*.vtt"))
if not subtitle_files:
return jsonify({"error": "No subtitles found"}), 404
latest_subtitle = max(subtitle_files, key=os.path.getctime)
transcript_text = clean_transcript(latest_subtitle)
return jsonify({"transcript": transcript_text})
finally:
- if latest_subtitle and os.path.exists(latest_subtitle):
- app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}")
- os.remove(latest_subtitle)
- elif not latest_subtitle:
- for stray in glob.glob(f"subtitles/{video_id}*.vtt"):
- if os.path.exists(stray):
- app.logger.info(f"Deleting stray temporary subtitle file: {stray}")
- os.remove(stray)
+ shutil.rmtree(request_dir, ignore_errors=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 572 - 578, The code currently finds the
newest file via subtitle_files = glob.glob("subtitles/*.vtt") and
latest_subtitle = max(...), which is race-prone; instead make the subtitle
output path request-scoped (e.g., create a unique temp dir or unique filename
per request) and read that exact path rather than scanning the shared
"subtitles" folder. Update the code that generates the .vtt to accept/return a
deterministic path (or use tempfile.mkdtemp()/NamedTemporaryFile) and then call
clean_transcript() with that specific path (replace the glob/max logic around
subtitle_files/latest_subtitle), and ensure cleanup happens only for that
request's file to avoid cross-request deletion.
| The `yt-dlp` injection fix runs the same week. Right now `video_id` comes from user input and goes into a subprocess call with `shell=True`. I sanitize the parameter and switch to argument list mode so shell metacharacters are never interpreted: | ||
|
|
||
| ```python | ||
| import re | ||
| import subprocess | ||
|
|
||
| if not re.fullmatch(r'[A-Za-z0-9_-]{6,16}', video_id): | ||
| raise ValueError("Invalid video ID") | ||
| subprocess.run(["yt-dlp", video_id, "-o", "output.mp4"]) | ||
| ``` | ||
|
|
||
| Test first here too. I pass `; rm -rf /tmp/test` as a video ID and confirm it gets blocked before touching the subprocess call. | ||
|
|
||
| File upload validation also goes in this week: a whitelist of PDF and DOCX only, size limits at both the application layer and Nginx, and payload checks that reject oversized requests before Flask ever sees them. |
There was a problem hiding this comment.
Align this section with the code shipped in this PR.
This paragraph describes the current backend as using shell=True and says the upload whitelist will be “PDF and DOCX only,” but backend/server.py already uses list-mode subprocess.run(...) and the implemented allowlist is txt, pdf, docx. Updating the proposal text will keep reviewers from reading two different baselines in the same PR.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@GSOC_PROPOSAL_DRAFT.md` around lines 57 - 70, The proposal text is out of
sync with the shipped code: update the paragraph so it reflects that the backend
already uses argument-list mode (subprocess.run([...]) with the video_id
variable passed as an element) instead of shell=True, and that the implemented
upload allowlist contains txt, pdf, and docx (not just PDF and DOCX); mention
the validation and size limits as implemented (application-layer checks plus
upstream Nginx limits) and remove the example implying shell=True or a different
allowlist.
Addressed Issues:
Improves file upload safety and validation
Screenshots/Recordings:
Not applicable (backend validation improvement)
Additional Notes:
Summary
This PR introduces basic input validation for file uploads to improve the safety and stability of the backend.
Previously, the upload flow accepted files without checking their size or type, which could lead to server overload or invalid file processing.
Changes
MAX_CONTENT_LENGTH)txt,pdf, anddocxImpact
This helps prevent:
AI Usage Disclosure:
I identified the issue and designed the solution through manual analysis. AI tools were used minimally to refine the implementation and improve edge-case handling.
Checklist
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes