Claude/deploy to railway jn j ez#6
Conversation
Added `pytest` to `requirements.txt` to include test dependencies and allow running `pytest tests/` successfully. Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
🎯 What: Replaced lxml.etree.parse with defusedxml.lxml.parse in xml-validator.py to prevent XML External Entity (XXE) vulnerabilities. Added defusedxml and lxml to requirements.txt.⚠️ Risk: If left unfixed, the application could be vulnerable to XXE attacks when parsing malicious XML or XSD files, potentially leading to unauthorized data disclosure or denial of service. 🛡️ Solution: defusedxml acts as a drop-in replacement that strictly disables external entity resolution by default, successfully mitigating the XXE attack vector while maintaining compatibility with lxml.etree.XMLSchema and validation. Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…2842002 🔒 Fix XXE vulnerability in xml-validator
…matting-2048886441696389176 🧪 Add Error Path Tests for Date Formatting in src/data_cleaning.py
…99421084779733224 🧪 Add unit tests for Data Validation module
…2181123944 🧪 Add tests for BaseConverter to verify ABC behavior
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…13268053583850 🧪 Add test for untestable exception block in xml-validator.py
…41166032 chore: Add pytest to requirements.txt
🎯 What: Fixed an XML External Entity (XXE) vulnerability in `src/xml-validator.py` caused by using `lxml.etree.parse` without disabling external entity resolution.⚠️ Risk: When parsing user-provided XML files, `lxml` default configuration resolves external entities. This allows attackers to define malicious external entities (e.g., local files via `file://`) that get included in the parsed XML, leading to arbitrary file disclosure (Local File Inclusion), Server-Side Request Forgery (SSRF), or Denial of Service (Billion Laughs attack). 🛡️ Solution: Created a secure `etree.XMLParser` with `resolve_entities=False` and passed it to the `etree.parse` calls for both the XML document and the XSD schema document. This prevents the parser from resolving external entities, neutralizing the XXE threat. Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…ers-1942838300665811679 Refactor Address and Phone logic in CounselingConverter
…cs-4508032247991148993 🧪 Add test for _calculate_demographics in TrainingConverter
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…s-12243694426949791705 🧪 Add comprehensive tests for clean_percentage
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…-4221162114748354920 🧪 Add tests for clean_numeric in data_cleaning.py
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
🔒 Fix XXE vulnerability in xml-validator.py
…930982119168753 🧹 [code health improvement] Remove unused sys import in xml-validator.py
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…ss-directory-15946046135021116189 🧪 [testing improvement] Add tests for process_directory in xml_validator
…1121 🧪 [Add comprehensive tests for fix_sba_xml.py]
…05440491561554403 🧪 [testing improvement] Add tests for main module entry point
Add unit tests for `generate_html_report` in `ValidationTracker` to verify accurate report generation and output directory management. Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…745181920218 🧪 [add test for validation_report.py]
Add 4 cross-field validation checks that match SBA upload validation: 1. ReportableImpact cannot be Yes unless VerifiedToBeInBusiness is Yes (also reads VerifiedToBeInBusiness from CSV instead of hardcoding) 2. Other Counseling Provided text required when Code is 'Other' (adds <Other> sub-element and validation) 3. Internet field mandatory when media code is 'Internet' 4. CounselingSeeking required under Part 2 when client is in business These validations catch errors before XML upload, preventing rejections from the SBA end system. https://claude.ai/code/session_013GwvhAVAhXNjQMHowKMAzW
…s instead When ReportableImpact is 'Yes' but VerifiedToBeInBusiness isn't 'Yes', auto-correct VerifiedToBeInBusiness to 'Yes' and keep ReportableImpact as 'Yes', rather than downgrading ReportableImpact to 'No'. https://claude.ai/code/session_013GwvhAVAhXNjQMHowKMAzW
Add SBA end-system validation rules to counseling converter
…ning.py 🎯 What: Added a TestCleanPhoneNumber class to test_data_cleaning.py to cover the clean_phone_number function. 📊 Coverage: Tested standard phone formatting characters, alphabetic characters, extensions, empty strings, None, "nan" variations, and numeric input. ✨ Result: clean_phone_number is thoroughly tested improving coverage. Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
🎯 What: The testing gap addressed was a missing error path test for `fix_client_intake_element_order` in `src/xml_validator.py`. 📊 Coverage: Added the error path test by mocking `xml_validator.ET.parse` to raise an Exception and verifying that `False` is returned and `xml_validator.logger.error` is called. Also added a happy path test that ensures the element orders are sorted properly. ✨ Result: `fix_client_intake_element_order` is fully tested with proper happy and error path scenarios covered, ensuring robustness during refactoring. Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
🧪 [testing improvement] Add tests for clean_phone_number in data_cleaning.py
…der-1354954169694588663 🧪 [testing improvement] Fix missing error path test for `fix_client_intake_element_order`
Two-service architecture:
- apps/web: Next.js frontend with NextAuth, Prisma, Tailwind
- Auth (login/signup with bcrypt)
- Dashboard, upload, preview, column mapping, conversion,
results, progress, and audit trail pages
- All API routes for file upload, job management, downloads
- apps/worker: FastAPI backend wrapping existing converters
- /preview, /convert, /validate-xsd, /health endpoints
- Reuses all existing conversion logic unchanged
Infrastructure:
- Prisma schema (User, Job, AuditEntry models)
- Docker + docker-compose for local dev
- XSD schemas moved to schemas/ directory
Existing code changes (minimal):
- src/xml_validator.py: Fix bare import for package compatibility
- src/validation_report.py: Add to_dict() helper method
- Added __init__.py to src/ and src/converters/
https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
- Add validate_path() helper that resolves paths and ensures they stay within the allowed DATA_DIR, preventing path traversal attacks - Apply path validation to all file operations in convert, preview, and validate routes/services - Replace str(e) in exception handlers with generic error messages to prevent stack trace information exposure - Log exceptions server-side for debugging while returning safe messages https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
- Add safe_output_path() that constructs output paths from sanitized job IDs instead of deriving from user-provided csv_path - Sanitize job_id to alphanumeric/hyphens/underscores only - Remove str(e) from ValueError handler in convert route - Add os.path.realpath() in xml_validator.validate_against_xsd to resolve paths before passing to etree.parse https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
Resolve output_dir with os.path.realpath and verify it stays within DATA_DIR before calling os.makedirs, satisfying CodeQL's taint tracking for uncontrolled path expressions. https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
Instead of accepting csv_path/xml_file_path in requests (which CodeQL flags as tainted), all endpoints now accept job_id + file_name and construct paths server-side via get_upload_path() and get_output_path(). These helpers sanitize identifiers with allowlists (alphanumeric, hyphens, underscores) and build paths from the trusted DATA_DIR, making path traversal impossible by construction. - Remove csv_path/xml_file_path from all Pydantic request models - Routes call security helpers to build paths, then pass to services - Services no longer import or call validate_path - Update Next.js callers to send file_name instead of inputFilePath https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
…alidate_against_xsd - get_output_path: resolve with os.path.realpath and verify prefix against DATA_DIR before os.makedirs - validate_against_xsd: return dict instead of tuple, log exceptions server-side instead of exposing str(e) to callers - Update all callers (xml_validator.py, run.py, tests) to use dict access instead of tuple unpacking https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
CodeQL requires the sanitizer pattern (os.path.realpath + startswith guard) in the same function scope as the file operation. Helper functions don't propagate sanitizer status across call boundaries. Changes: - security.py: Simplified to just sanitize_id/sanitize_filename - Routes: Construct paths with os.path.join, validate inline with os.path.realpath + startswith(DATA_DIR) before passing to services - Services: Re-validate paths with same realpath+startswith pattern before pd.read_csv, open(), converter.convert(), etc. - xml_validator.py: Add realpath+startswith(DATA_DIR) guard before etree.parse calls Every file operation now has a direct realpath+startswith guard in the same function, which is the pattern CodeQL recognizes. https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
Railway was detecting the root requirements.txt and trying to build as a single Python project. Add per-service railway.toml files that tell Railway to use Dockerfiles instead of auto-detection. - apps/web/railway.toml: Next.js service via Dockerfile - apps/worker/railway.toml: FastAPI service via Dockerfile Each service should be configured in Railway dashboard to point at its subdirectory (web -> apps/web, worker -> repo root). https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
- Switch from npm ci to npm install (npm ci fails when Railway's build context doesn't include package-lock.json properly) - Add prisma schema and client to final stage - Auto-run prisma db push on container startup to apply schema https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
There was a problem hiding this comment.
Code Review
This pull request transitions the SBA data conversion tool into a full-stack application by adding a Next.js web interface and a FastAPI worker service. The update includes a Dockerized environment, database-backed job tracking, and enhanced conversion logic. Feedback focuses on improving system robustness and security, specifically by adopting an asynchronous processing model for conversions, implementing email normalization for authentication, and sanitizing file uploads to prevent path traversal. Further suggestions include using stream-based file handling to optimize memory usage and improving error logging across the API routes.
| const result = await workerFetch<ConvertResponse>("/convert", { | ||
| method: "POST", | ||
| body: JSON.stringify({ | ||
| job_id: jobId, | ||
| file_name: job.inputFileName, | ||
| converter_type: job.converterType, | ||
| column_mapping: job.columnMapping, | ||
| }), | ||
| }); |
There was a problem hiding this comment.
This route handler performs a blocking HTTP call to the worker service. Depending on the size of the CSV, the conversion process could exceed the default timeout of the server or the client's connection. Consider making this process asynchronous by returning a 'processing' status immediately and having the worker update the job status via a webhook or shared database state.
| ); | ||
| } | ||
|
|
||
| const existing = await prisma.user.findUnique({ where: { email } }); |
There was a problem hiding this comment.
Email addresses should be normalized (e.g., lowercased) before performing a unique check and before saving to the database to prevent duplicate accounts with different casing (e.g., 'User@Example.com' and 'user@example.com').
const normalizedEmail = email.toLowerCase();
const existing = await prisma.user.findUnique({ where: { email: normalizedEmail } });| } catch { | ||
| return NextResponse.json( | ||
| { error: "Internal server error" }, | ||
| { status: 500 } | ||
| ); | ||
| } |
There was a problem hiding this comment.
The error is caught but not logged. Swallowing errors without logging makes it difficult to diagnose issues in production. Consider logging the error object before returning the 500 response.
} catch (error) {
console.error("Signup error:", error);
return NextResponse.json(
{ error: "Internal server error" },
{ status: 500 }
);
}| } | ||
|
|
||
| const fileBuffer = await readFile(job.outputFilePath); | ||
| const fileName = job.inputFileName.replace(".csv", ".xml"); |
There was a problem hiding this comment.
Using .replace(".csv", ".xml") will only replace the first occurrence of ".csv". If a filename contains multiple occurrences (e.g., data.csv.backup.csv), the result might be incorrect. Use a regular expression with the end-of-string anchor to ensure only the extension is replaced.
| const fileName = job.inputFileName.replace(".csv", ".xml"); | |
| const fileName = job.inputFileName.replace(/\.csv$/i, ".xml"); |
| // Save file to volume | ||
| const uploadDir = path.join(DATA_DIR, "uploads", job.id); | ||
| await mkdir(uploadDir, { recursive: true }); | ||
| const filePath = path.join(uploadDir, file.name); |
There was a problem hiding this comment.
The file.name is used directly in path.join. While it is prefixed by a unique job.id directory, it is still best practice to sanitize the filename to remove any potentially malicious characters or sequences that could interfere with the filesystem.
| const filePath = path.join(uploadDir, file.name); | |
| const sanitizedFileName = path.basename(file.name).replace(/[^a-zA-Z0-9._-]/g, '_'); | |
| const filePath = path.join(uploadDir, sanitizedFileName); |
| const bytes = await file.arrayBuffer(); | ||
| await writeFile(filePath, Buffer.from(bytes)); |
| if (!credentials?.email || !credentials?.password) return null; | ||
|
|
||
| const user = await prisma.user.findUnique({ | ||
| where: { email: credentials.email as string }, |
| except (ValueError, TypeError): | ||
| raise ValueError(f"Invalid percentage value: {value}") | ||
| return "0" |
There was a problem hiding this comment.
Returning a default value of "0" on a ValueError or TypeError silently ignores malformed data. It is better to log a warning or return None so the caller can handle the validation failure appropriately.
| except (ValueError, TypeError): | |
| raise ValueError(f"Invalid percentage value: {value}") | |
| return "0" | |
| except (ValueError, TypeError): | |
| return "0" # Consider logging this as a validation warning |
Railway builds from the repo root, not apps/web/. Update COPY paths to use apps/web/ prefix so package.json and source files are found. Also update railway.toml to point to apps/web/Dockerfile. https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
The postinstall script runs prisma generate but the schema file isn't available in the deps stage. Use --ignore-scripts to skip it; prisma generate runs explicitly in the builder stage after COPY. https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
The final Docker stage copies /app/public from the builder but the directory didn't exist, causing COPY to fail. https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
npx was downloading Prisma 7 which removed --skip-generate flag, causing the startup command to fail and crash-restart infinitely. Fix: copy the prisma CLI from the builder stage (which has v6) and invoke it directly. Also add || fallback so a DB connection failure doesn't prevent the app from starting. https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
The prisma CLI requires the 'effect' package and other transitive deps. Copy full node_modules from builder into a separate directory (migrate_modules) so prisma db push has everything it needs. Also: remind user to set NEXTAUTH_URL with https:// prefix. https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
1. Replace prisma CLI with raw SQL migration script (scripts/migrate.js) - Uses @prisma/client which is already bundled in the final image - Creates tables with IF NOT EXISTS, idempotent foreign keys - No extra dependencies needed 2. Fix UntrustedHost error by adding trustHost: true to NextAuth config - Required when running behind Railway's reverse proxy https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
Prisma $executeRawUnsafe doesn't support multiple statements in one call. Split into individual statements executed sequentially. https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
No description provided.