Skip to content

Claude/deploy to railway jn j ez#6

Open
daler91 wants to merge 114 commits into
rdale-dev:masterfrom
daler91:claude/deploy-to-railway-JnJEz
Open

Claude/deploy to railway jn j ez#6
daler91 wants to merge 114 commits into
rdale-dev:masterfrom
daler91:claude/deploy-to-railway-JnJEz

Conversation

@daler91
Copy link
Copy Markdown

@daler91 daler91 commented Mar 30, 2026

No description provided.

google-labs-jules Bot and others added 30 commits March 19, 2026 16:04
Added `pytest` to `requirements.txt` to include test dependencies and allow running `pytest tests/` successfully.

Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
🎯 What: Replaced lxml.etree.parse with defusedxml.lxml.parse in xml-validator.py to prevent XML External Entity (XXE) vulnerabilities. Added defusedxml and lxml to requirements.txt.
⚠️ Risk: If left unfixed, the application could be vulnerable to XXE attacks when parsing malicious XML or XSD files, potentially leading to unauthorized data disclosure or denial of service.
🛡️ Solution: defusedxml acts as a drop-in replacement that strictly disables external entity resolution by default, successfully mitigating the XXE attack vector while maintaining compatibility with lxml.etree.XMLSchema and validation.

Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…2842002

🔒 Fix XXE vulnerability in xml-validator
…matting-2048886441696389176

🧪 Add Error Path Tests for Date Formatting in src/data_cleaning.py
…99421084779733224

🧪 Add unit tests for Data Validation module
…2181123944

🧪 Add tests for BaseConverter to verify ABC behavior
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…13268053583850

🧪 Add test for untestable exception block in xml-validator.py
…41166032

chore: Add pytest to requirements.txt
🎯 What:
Fixed an XML External Entity (XXE) vulnerability in `src/xml-validator.py` caused by using `lxml.etree.parse` without disabling external entity resolution.

⚠️ Risk:
When parsing user-provided XML files, `lxml` default configuration resolves external entities. This allows attackers to define malicious external entities (e.g., local files via `file://`) that get included in the parsed XML, leading to arbitrary file disclosure (Local File Inclusion), Server-Side Request Forgery (SSRF), or Denial of Service (Billion Laughs attack).

🛡️ Solution:
Created a secure `etree.XMLParser` with `resolve_entities=False` and passed it to the `etree.parse` calls for both the XML document and the XSD schema document. This prevents the parser from resolving external entities, neutralizing the XXE threat.

Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…ers-1942838300665811679

Refactor Address and Phone logic in CounselingConverter
…cs-4508032247991148993

🧪 Add test for _calculate_demographics in TrainingConverter
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…s-12243694426949791705

🧪 Add comprehensive tests for clean_percentage
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…-4221162114748354920

🧪 Add tests for clean_numeric in data_cleaning.py
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
🔒 Fix XXE vulnerability in xml-validator.py
…930982119168753

🧹 [code health improvement] Remove unused sys import in xml-validator.py
google-labs-jules Bot and others added 22 commits March 20, 2026 01:08
Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…ss-directory-15946046135021116189

🧪 [testing improvement] Add tests for process_directory in xml_validator
…1121

🧪 [Add comprehensive tests for fix_sba_xml.py]
…05440491561554403

🧪 [testing improvement] Add tests for main module entry point
Add unit tests for `generate_html_report` in `ValidationTracker` to verify accurate report generation and output directory management.

Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
…745181920218

🧪 [add test for validation_report.py]
Add 4 cross-field validation checks that match SBA upload validation:
1. ReportableImpact cannot be Yes unless VerifiedToBeInBusiness is Yes
   (also reads VerifiedToBeInBusiness from CSV instead of hardcoding)
2. Other Counseling Provided text required when Code is 'Other'
   (adds <Other> sub-element and validation)
3. Internet field mandatory when media code is 'Internet'
4. CounselingSeeking required under Part 2 when client is in business

These validations catch errors before XML upload, preventing rejections
from the SBA end system.

https://claude.ai/code/session_013GwvhAVAhXNjQMHowKMAzW
…s instead

When ReportableImpact is 'Yes' but VerifiedToBeInBusiness isn't 'Yes',
auto-correct VerifiedToBeInBusiness to 'Yes' and keep ReportableImpact
as 'Yes', rather than downgrading ReportableImpact to 'No'.

https://claude.ai/code/session_013GwvhAVAhXNjQMHowKMAzW
Add SBA end-system validation rules to counseling converter
…ning.py

🎯 What: Added a TestCleanPhoneNumber class to test_data_cleaning.py to cover the clean_phone_number function.
📊 Coverage: Tested standard phone formatting characters, alphabetic characters, extensions, empty strings, None, "nan" variations, and numeric input.
✨ Result: clean_phone_number is thoroughly tested improving coverage.

Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
🎯 What: The testing gap addressed was a missing error path test for `fix_client_intake_element_order` in `src/xml_validator.py`.
📊 Coverage: Added the error path test by mocking `xml_validator.ET.parse` to raise an Exception and verifying that `False` is returned and `xml_validator.logger.error` is called. Also added a happy path test that ensures the element orders are sorted properly.
✨ Result: `fix_client_intake_element_order` is fully tested with proper happy and error path scenarios covered, ensuring robustness during refactoring.

Co-authored-by: rdale-dev <203160809+rdale-dev@users.noreply.github.com>
🧪 [testing improvement] Add tests for clean_phone_number in data_cleaning.py
…der-1354954169694588663

🧪 [testing improvement] Fix missing error path test for `fix_client_intake_element_order`
Two-service architecture:
- apps/web: Next.js frontend with NextAuth, Prisma, Tailwind
  - Auth (login/signup with bcrypt)
  - Dashboard, upload, preview, column mapping, conversion,
    results, progress, and audit trail pages
  - All API routes for file upload, job management, downloads
- apps/worker: FastAPI backend wrapping existing converters
  - /preview, /convert, /validate-xsd, /health endpoints
  - Reuses all existing conversion logic unchanged

Infrastructure:
- Prisma schema (User, Job, AuditEntry models)
- Docker + docker-compose for local dev
- XSD schemas moved to schemas/ directory

Existing code changes (minimal):
- src/xml_validator.py: Fix bare import for package compatibility
- src/validation_report.py: Add to_dict() helper method
- Added __init__.py to src/ and src/converters/

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
- Add validate_path() helper that resolves paths and ensures they stay
  within the allowed DATA_DIR, preventing path traversal attacks
- Apply path validation to all file operations in convert, preview,
  and validate routes/services
- Replace str(e) in exception handlers with generic error messages
  to prevent stack trace information exposure
- Log exceptions server-side for debugging while returning safe messages

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
- Add safe_output_path() that constructs output paths from sanitized
  job IDs instead of deriving from user-provided csv_path
- Sanitize job_id to alphanumeric/hyphens/underscores only
- Remove str(e) from ValueError handler in convert route
- Add os.path.realpath() in xml_validator.validate_against_xsd to
  resolve paths before passing to etree.parse

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
Resolve output_dir with os.path.realpath and verify it stays within
DATA_DIR before calling os.makedirs, satisfying CodeQL's taint
tracking for uncontrolled path expressions.

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
Instead of accepting csv_path/xml_file_path in requests (which CodeQL
flags as tainted), all endpoints now accept job_id + file_name and
construct paths server-side via get_upload_path() and get_output_path().

These helpers sanitize identifiers with allowlists (alphanumeric,
hyphens, underscores) and build paths from the trusted DATA_DIR,
making path traversal impossible by construction.

- Remove csv_path/xml_file_path from all Pydantic request models
- Routes call security helpers to build paths, then pass to services
- Services no longer import or call validate_path
- Update Next.js callers to send file_name instead of inputFilePath

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
…alidate_against_xsd

- get_output_path: resolve with os.path.realpath and verify prefix
  against DATA_DIR before os.makedirs
- validate_against_xsd: return dict instead of tuple, log exceptions
  server-side instead of exposing str(e) to callers
- Update all callers (xml_validator.py, run.py, tests) to use dict
  access instead of tuple unpacking

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
CodeQL requires the sanitizer pattern (os.path.realpath + startswith
guard) in the same function scope as the file operation. Helper
functions don't propagate sanitizer status across call boundaries.

Changes:
- security.py: Simplified to just sanitize_id/sanitize_filename
- Routes: Construct paths with os.path.join, validate inline with
  os.path.realpath + startswith(DATA_DIR) before passing to services
- Services: Re-validate paths with same realpath+startswith pattern
  before pd.read_csv, open(), converter.convert(), etc.
- xml_validator.py: Add realpath+startswith(DATA_DIR) guard before
  etree.parse calls

Every file operation now has a direct realpath+startswith guard in
the same function, which is the pattern CodeQL recognizes.

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
Railway was detecting the root requirements.txt and trying to build
as a single Python project. Add per-service railway.toml files that
tell Railway to use Dockerfiles instead of auto-detection.

- apps/web/railway.toml: Next.js service via Dockerfile
- apps/worker/railway.toml: FastAPI service via Dockerfile

Each service should be configured in Railway dashboard to point at
its subdirectory (web -> apps/web, worker -> repo root).

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
- Switch from npm ci to npm install (npm ci fails when Railway's
  build context doesn't include package-lock.json properly)
- Add prisma schema and client to final stage
- Auto-run prisma db push on container startup to apply schema

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions the SBA data conversion tool into a full-stack application by adding a Next.js web interface and a FastAPI worker service. The update includes a Dockerized environment, database-backed job tracking, and enhanced conversion logic. Feedback focuses on improving system robustness and security, specifically by adopting an asynchronous processing model for conversions, implementing email normalization for authentication, and sanitizing file uploads to prevent path traversal. Further suggestions include using stream-based file handling to optimize memory usage and improving error logging across the API routes.

Comment on lines +39 to +47
const result = await workerFetch<ConvertResponse>("/convert", {
method: "POST",
body: JSON.stringify({
job_id: jobId,
file_name: job.inputFileName,
converter_type: job.converterType,
column_mapping: job.columnMapping,
}),
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This route handler performs a blocking HTTP call to the worker service. Depending on the size of the CSV, the conversion process could exceed the default timeout of the server or the client's connection. Consider making this process asynchronous by returning a 'processing' status immediately and having the worker update the job status via a webhook or shared database state.

);
}

const existing = await prisma.user.findUnique({ where: { email } });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Email addresses should be normalized (e.g., lowercased) before performing a unique check and before saving to the database to prevent duplicate accounts with different casing (e.g., 'User@Example.com' and 'user@example.com').

    const normalizedEmail = email.toLowerCase();
    const existing = await prisma.user.findUnique({ where: { email: normalizedEmail } });

Comment on lines +40 to +45
} catch {
return NextResponse.json(
{ error: "Internal server error" },
{ status: 500 }
);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error is caught but not logged. Swallowing errors without logging makes it difficult to diagnose issues in production. Consider logging the error object before returning the 500 response.

  } catch (error) {
    console.error("Signup error:", error);
    return NextResponse.json(
      { error: "Internal server error" },
      { status: 500 }
    );
  }

}

const fileBuffer = await readFile(job.outputFilePath);
const fileName = job.inputFileName.replace(".csv", ".xml");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using .replace(".csv", ".xml") will only replace the first occurrence of ".csv". If a filename contains multiple occurrences (e.g., data.csv.backup.csv), the result might be incorrect. Use a regular expression with the end-of-string anchor to ensure only the extension is replaced.

Suggested change
const fileName = job.inputFileName.replace(".csv", ".xml");
const fileName = job.inputFileName.replace(/\.csv$/i, ".xml");

// Save file to volume
const uploadDir = path.join(DATA_DIR, "uploads", job.id);
await mkdir(uploadDir, { recursive: true });
const filePath = path.join(uploadDir, file.name);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The file.name is used directly in path.join. While it is prefixed by a unique job.id directory, it is still best practice to sanitize the filename to remove any potentially malicious characters or sequences that could interfere with the filesystem.

Suggested change
const filePath = path.join(uploadDir, file.name);
const sanitizedFileName = path.basename(file.name).replace(/[^a-zA-Z0-9._-]/g, '_');
const filePath = path.join(uploadDir, sanitizedFileName);

Comment on lines +51 to +52
const bytes = await file.arrayBuffer();
await writeFile(filePath, Buffer.from(bytes));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Reading the entire file into memory using file.arrayBuffer() can lead to high memory consumption for large files (e.g., the 50MB limit mentioned in the UI). Consider using a stream-based approach to write the file directly to disk.

Comment thread apps/web/src/lib/auth.ts
if (!credentials?.email || !credentials?.password) return null;

const user = await prisma.user.findUnique({
where: { email: credentials.email as string },
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Email normalization should be applied here as well to ensure users can log in regardless of the casing used during registration or login.

        const user = await prisma.user.findUnique({
          where: { email: (credentials.email as string).toLowerCase() },
        });

Comment thread src/data_cleaning.py
Comment on lines 351 to +352
except (ValueError, TypeError):
raise ValueError(f"Invalid percentage value: {value}")
return "0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Returning a default value of "0" on a ValueError or TypeError silently ignores malformed data. It is better to log a warning or return None so the caller can handle the validation failure appropriately.

Suggested change
except (ValueError, TypeError):
raise ValueError(f"Invalid percentage value: {value}")
return "0"
except (ValueError, TypeError):
return "0" # Consider logging this as a validation warning

claude added 7 commits March 30, 2026 22:35
Railway builds from the repo root, not apps/web/. Update COPY paths
to use apps/web/ prefix so package.json and source files are found.
Also update railway.toml to point to apps/web/Dockerfile.

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
The postinstall script runs prisma generate but the schema file isn't
available in the deps stage. Use --ignore-scripts to skip it; prisma
generate runs explicitly in the builder stage after COPY.

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
The final Docker stage copies /app/public from the builder but the
directory didn't exist, causing COPY to fail.

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
npx was downloading Prisma 7 which removed --skip-generate flag,
causing the startup command to fail and crash-restart infinitely.

Fix: copy the prisma CLI from the builder stage (which has v6) and
invoke it directly. Also add || fallback so a DB connection failure
doesn't prevent the app from starting.

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
The prisma CLI requires the 'effect' package and other transitive
deps. Copy full node_modules from builder into a separate directory
(migrate_modules) so prisma db push has everything it needs.

Also: remind user to set NEXTAUTH_URL with https:// prefix.

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
1. Replace prisma CLI with raw SQL migration script (scripts/migrate.js)
   - Uses @prisma/client which is already bundled in the final image
   - Creates tables with IF NOT EXISTS, idempotent foreign keys
   - No extra dependencies needed

2. Fix UntrustedHost error by adding trustHost: true to NextAuth config
   - Required when running behind Railway's reverse proxy

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
Prisma $executeRawUnsafe doesn't support multiple statements in one
call. Split into individual statements executed sequentially.

https://claude.ai/code/session_01BjsLNuDZXU8417rtp7sCjZ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants