diff --git a/.cursor/skills/release-qa/SKILL.md b/.cursor/skills/release-qa/SKILL.md new file mode 100644 index 0000000000..6b1528e36f --- /dev/null +++ b/.cursor/skills/release-qa/SKILL.md @@ -0,0 +1,195 @@ +--- +name: release-qa +description: | + Runs pre-release QA for Psoxy: verify release refs (rc-vX.Y.Z → vX.Y.Z), apply AWS and GCP + dev examples sequentially, run test-all.sh for both, summarize connector status, create the + rc-to-main release PR, and post QA results on that PR. Use when cutting a release, running + release QA, merging rc-v to main, or when the user asks to test connectors before publish. +--- + +# Release QA + +End-to-end release QA for the Psoxy repo on an `rc-vX.Y.Z` branch that has been prepared for release (`./tools/release/prep.sh rc-vX.Y.Z vX.Y.Z`). + +## Prerequisites + +- On branch `rc-vX.Y.Z` with release refs already updated to `vX.Y.Z` +- Authenticated: `aws`, `gcloud` (+ ADC), and `az` (if `msft_tenant_id` in tfvars) +- `gh` CLI authenticated +- `terraform` available in PATH +- Repo root as working directory unless noted + +Derive `RELEASE` from the branch (`rc-v0.6.6` → `v0.6.6`) or accept it from the user. + +## Workflow checklist + +``` +Release QA progress: +- [ ] Step 1: Verify release refs +- [ ] Step 2: Apply AWS example (review plan log) +- [ ] Step 3: Apply GCP example (review plan log) +- [ ] Step 4: Run test-all on AWS +- [ ] Step 5: Run test-all on GCP +- [ ] Step 6: Summarize connector results +- [ ] Step 7: Create release PR (rc-to-main) +- [ ] Step 8: Post PR comment + check off test plan +``` + +Run steps **sequentially**. Do not apply AWS and GCP in parallel. + +--- + +## Step 1: Verify release refs + +If refs are not yet updated, run prep first (interactive): + +```bash +./tools/release/prep.sh rc-vX.Y.Z vX.Y.Z +``` + +Then verify: + +```bash +./tools/release/qa/verify-release-refs.sh vX.Y.Z +``` + +Stop if verification fails. Fix with `prep.sh` or manual ref updates before continuing. + +--- + +## Step 2–3: Apply dev examples (sequential) + +Use the non-interactive helper (runs `terraform plan` then `terraform apply`, logs both): + +```bash +./tools/release/qa/apply-example.sh aws vX.Y.Z true +# Review plan log printed path; confirm apply succeeded before continuing + +./tools/release/qa/apply-example.sh gcp vX.Y.Z true +``` + +Logs land in `infra/examples-dev/{aws,gcp}/YYYYMMDD_{aws|gcp}-vX.Y.Z-{plan,apply}.txt`. + +**Review the plan logs** and call out unexpected destroys/replacements before running tests. + +`force_bundle=true` rebuilds the JAR (appropriate for release QA after Java changes). + +--- + +## Step 4–5: Run connector tests + +```bash +./tools/release/qa/run-example-tests.sh aws vX.Y.Z +./tools/release/qa/run-example-tests.sh gcp vX.Y.Z +``` + +Outputs: `infra/examples-dev/{aws,gcp}/YYYYMMDD_{aws|gcp}-vX.Y.Z-tests.txt` + +Tests can take several minutes each (Slack async, bulk uploads, llm-portal bucket polling). + +--- + +## Step 6: Summarize connector state + +```bash +./tools/release/qa/summarize-connector-tests.sh aws infra/examples-dev/aws/YYYYMMDD_aws-vX.Y.Z-tests.txt vX.Y.Z \ + > /tmp/aws-qa-summary.md + +./tools/release/qa/summarize-connector-tests.sh gcp infra/examples-dev/gcp/YYYYMMDD_gcp-vX.Y.Z-tests.txt vX.Y.Z \ + > /tmp/gcp-qa-summary.md +``` + +Each command also writes sidecar files: + +- `*.summary.md` — markdown tables + category breakdown +- `*.checklist` — machine-readable pass/fail per test-plan category + +Status meanings: + +| Status | Meaning | +|--------|---------| +| **pass** | Health + API/bulk/webhook verification succeeded | +| **partial** | Proxy healthy but upstream API rejected the call | +| **fail** | Missing secrets/config or connection setup error | + +Test-plan categories (from `tools/release/test_plan.md`): + +| Category | Example connectors | +|----------|-------------------| +| Microsoft API | `azure-ad`, `outlook-cal`, `msft-teams` | +| Google Workspace API | `gcal`, `gdirectory`, `google-chat`, `gmail`, `gemini-in-workspace-apps` | +| Token-based API | `asana`, `slack-analytics`, `zoom`, `jira-cloud`, `github`, … | +| API with async | `slack-analytics` | +| Webhook collector | `llm-portal` | +| Bulk connector | `hris`, `metrics`, `workdata-generic` | + +A category is checked off when **at least one** connector in that category passes (partial counts for PR checkboxes). + +Present the user a combined summary before opening the PR. Note credential gaps vs real regressions. + +--- + +## Step 7: Create release PR + +Must be on `rc-vX.Y.Z`: + +```bash +git checkout rc-vX.Y.Z +./tools/release/rc-to-main.sh vX.Y.Z +``` + +`rc-to-main.sh` is partially interactive (`npm audit fix` prompt). Answer `y` to continue unless dependency changes need a separate PR. + +Capture the PR URL/number from script output. + +--- + +## Step 8: Post results on the release PR + +```bash +PR_NUMBER=... # from rc-to-main.sh output + +./tools/release/qa/update-release-pr-results.sh \ + "$PR_NUMBER" \ + infra/examples-dev/aws/YYYYMMDD_aws-vX.Y.Z-tests.txt.checklist \ + infra/examples-dev/gcp/YYYYMMDD_gcp-vX.Y.Z-tests.txt.checklist \ + infra/examples-dev/aws/YYYYMMDD_aws-vX.Y.Z-tests.txt.summary.md \ + infra/examples-dev/gcp/YYYYMMDD_gcp-vX.Y.Z-tests.txt.summary.md +``` + +This: + +1. Posts a PR comment with both AWS and GCP connector summaries +2. Checks off `- [x]` items under `### AWS` and `### GCP` in the PR body for categories that passed (including partial) + +--- + +## After merge + +Remind the user: + +```bash +./tools/release/publish.sh vX.Y.Z +``` + +--- + +## Troubleshooting + +| Issue | Action | +|-------|--------| +| `verify-release-refs.sh` fails | Run `./tools/release/prep.sh rc-vX.Y.Z vX.Y.Z` | +| Apply auth errors | Re-run `./az-auth`, `aws sso login`, `gcloud auth application-default login` | +| Connector fails with `missingConfigProperties` | Expected for unconfigured secrets; note in summary, not a proxy regression | +| `msft-teams` 401 while `azure-ad` works | Azure Graph permissions/consent issue | +| `rc-to-main.sh` branch error | Checkout `rc-vX.Y.Z` first | + +## Helper scripts + +| Script | Purpose | +|--------|---------| +| `tools/release/qa/verify-release-refs.sh` | Confirm rc → v ref migration | +| `tools/release/qa/apply-example.sh` | Plan + apply with logs | +| `tools/release/qa/run-example-tests.sh` | Run `test-all.sh`, capture output | +| `tools/release/qa/summarize-connector-tests.sh` | Parse test output → markdown | +| `tools/release/qa/update-release-pr-results.sh` | PR comment + checkbox update | diff --git a/.github/workflows/codeql.yml b/.github/workflows/codeql.yml index 56c3113071..cff1c6cc10 100644 --- a/.github/workflows/codeql.yml +++ b/.github/workflows/codeql.yml @@ -47,22 +47,24 @@ jobs: matrix: include: - language: java-kotlin + category: java build-mode: none # This mode only analyzes Java. Set this to 'autobuild' or 'manual' to analyze Kotlin too. - language: javascript-typescript + category: javascript build-mode: none steps: - name: Checkout repository uses: actions/checkout@v4 - - name: Setup Java - if: matrix.language == 'java-kotlin' - uses: actions/setup-java@v4 - with: - java-version: '21' - distribution: zulu + - name: Setup Java + if: matrix.language == 'java-kotlin' + uses: actions/setup-java@v4 + with: + java-version: '21' + distribution: zulu # Initializes the CodeQL tools for scanning. - name: Initialize CodeQL - uses: github/codeql-action/init@v3 + uses: github/codeql-action/init@v4 with: languages: ${{ matrix.language }} build-mode: ${{ matrix.build-mode }} @@ -90,6 +92,6 @@ jobs: exit 1 - name: Perform CodeQL Analysis - uses: github/codeql-action/analyze@v3 + uses: github/codeql-action/analyze@v4 with: - category: "/language:${{matrix.language}}" + category: "/language:${{ matrix.category || matrix.language }}" diff --git a/.github/workflows/link-checker.yml b/.github/workflows/link-checker.yml new file mode 100644 index 0000000000..c20471143b --- /dev/null +++ b/.github/workflows/link-checker.yml @@ -0,0 +1,37 @@ +name: Link Checker + +on: + push: + branches: + - main + paths: + - 'docs/**/*.md' + - 'lychee.toml' + - '.github/workflows/link-checker.yml' + pull_request: + paths: + - 'docs/**/*.md' + - 'lychee.toml' + - '.github/workflows/link-checker.yml' + workflow_dispatch: + +jobs: + link-checker: + runs-on: ubuntu-latest + permissions: + contents: read + steps: + - uses: actions/checkout@v4 + + - name: Link Checker + uses: lycheeverse/lychee-action@v2 + with: + fail: true + args: >- + --config lychee.toml + --exclude-loopback + --verbose + --no-progress + docs/ + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} diff --git a/AGENTS.md b/AGENTS.md index 17e60a46c8..8a69b42b54 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -32,6 +32,10 @@ fi printf "${SUCCESS}Operation completed successfully.${NC}\n" ``` +## Release QA + +Before merging an `rc-vX.Y.Z` branch to `main`, follow [tools/release/release-qa.md](tools/release/release-qa.md). The orchestrator is `./tools/release/run-release-qa.sh vX.Y.Z`. + ## Testing Conventions When modifying code in this repository, you should ensure that your changes pass our standardized tests. diff --git a/CHANGELOG.md b/CHANGELOG.md index f3d0d4ddab..679fdb8378 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,10 +5,13 @@ in each release's notes. Changes to be including in future/planned release notes will be added here. -## [0.6.5] +## [Unreleased] +- `aws`/`gcp`: fix Terraform plan failure when `enable_remote_resources = true` but no artifacts bucket exists (e.g. with a prebuilt `deployment_bundle`). When remote resources are enabled, an artifacts bucket is now provisioned if one is not already created or provided via `artifacts_bucket_name` / `custom_artifacts_bucket_name`. + +## [0.6.5](https://github.com/Worklytics/psoxy/releases/tag/v0.6.5) - added `claude-enterprise-analytics` connector in **beta**; imports per-user daily activity, token usage, and cost data from the [Claude Enterprise Analytics API](https://support.claude.com/en/articles/13703965-claude-enterprise-analytics-api-reference-guide); see [docs/sources/anthropic/claude-enterprise-analytics/README.md](docs/sources/anthropic/claude-enterprise-analytics/README.md) -## [0.6.4] +## [0.6.4](https://github.com/Worklytics/psoxy/releases/tag/v0.6.4) - `aws`: consolidate IAM policies at the `aws-host` level to reduce per-connector policy/attachment churn (important for customers with low per-role IAM policy limits). PsoxyCaller now receives a single `CallerAccess` policy (lambda invoke, when applicable, plus read access to all provisioned output buckets: bulk sanitized, async, side-output, webhook, and lookup). Non-caller lookup-table accessor roles receive per-lookup `LookupBucketRead` policies scoped to their lookup bucket only. Bulk connector testing uses S3 bucket policies on each input/sanitized bucket granting the Terraform test principal upload/read/delete as needed, avoiding additional IAM policy attachments on the test role. **Upgrading customers should expect Terraform to destroy and recreate several IAM policies and attachments**; effective access should be unchanged, but we encourage reviewing the plan. ## [0.6.3](https://github.com/Worklytics/psoxy/releases/tag/v0.6.3) diff --git a/docs/README.md b/docs/README.md index 500f146182..a99745a5b5 100644 --- a/docs/README.md +++ b/docs/README.md @@ -80,7 +80,7 @@ Note: Some sources require specific licenses to transfer data via the APIs/endpo ### Google Workspace (formerly GSuite) -For all of these, a Google Workspace Admin must authorize the Google OAuth client you provision (with [provided terraform modules](https://github.com/Worklytics/psoxy/tree/main/infra/examples)) to access your organization's data. This requires a Domain-wide Delegation grant with a set of scopes specific to each data source, via the Google Workspace Admin Console. +For all of these, a Google Workspace Admin must authorize the Google OAuth client you provision (with [provided terraform modules](https://github.com/Worklytics/psoxy/tree/main/infra/examples-dev)) to access your organization's data. This requires a Domain-wide Delegation grant with a set of scopes specific to each data source, via the Google Workspace Admin Console. If you use our provided Terraform modules, specific instructions that you can pass to the Google Workspace Admin will be output for you. @@ -92,11 +92,11 @@ If you use our provided Terraform modules, specific instructions that you can pa | Google Drive | [data](https://github.com/Worklytics/psoxy/tree/main/docs/sources/google-workspace/gdrive/example-api-responses) - [rules](https://github.com/Worklytics/psoxy/tree/main/docs/sources/google-workspace/gdrive/gdrive.yaml) | `drive.metadata.readonly` | | GMail | [data](https://github.com/Worklytics/psoxy/tree/main/docs/sources/google-workspace/gmail/example-api-responses) - [rules](https://github.com/Worklytics/psoxy/tree/main/docs/sources/google-workspace/gmail/gmail.yaml) | `gmail.metadata` | | Google Meet | [data](https://github.com/Worklytics/psoxy/tree/main/docs/sources/google-workspace/meet/example-api-responses) - [rules](https://github.com/Worklytics/psoxy/tree/main/docs/sources/google-workspace/meet/meet.yaml) | `admin.reports.audit.readonly` | -| Gemini Bulk (**deprecated**) | [data](https://github.com/Worklytics/psoxy/tree/main/docs/sources/google-workspace/gemini-usage/example.csv) | n/a; bulk export of Gemini logs | +| Gemini Bulk (**deprecated**) | [docs](https://github.com/Worklytics/psoxy/tree/main/docs/sources/google-workspace/gemini-usage-bulk) | n/a; bulk export of Gemini logs | | Gemini in Google Workspace (**beta**) | [data](https://github.com/Worklytics/psoxy/tree/main/docs/sources/google-workspace/gemini-in-workspace-apps/example-api-responses) - [rules](https://github.com/Worklytics/psoxy/tree/main/docs/sources/google-workspace/gemini-in-workspace-apps/gemini-in-workspace-apps.yaml) | `admin.reports.audit.readonly` | -NOTE: the above scopes are copied from [infra/modules/worklytics-connector-specs](https://github.com/Worklytics/psoxy/tree/main/infra/modules/worklytics-connector-specs). Please refer to that module for a definitive list. +NOTE: the above scopes are copied from [`infra/modules/worklytics-connector-specs/google-workspace.tf`](https://github.com/Worklytics/psoxy/blob/main/infra/modules/worklytics-connector-specs/google-workspace.tf). Per-connector scope lists are on each [Google Workspace connector](sources/google-workspace/README.md) page. Please refer to that module for a definitive list. NOTE: 'Google Directory' connection is required prerequisite for all other Google Workspace connectors. @@ -108,7 +108,7 @@ See details: [sources/google-workspace/README.md](sources/google-workspace/READM ### Microsoft 365 -For all of these, a Microsoft 365 Admin (at minimum, a [Privileged Role Administrator](https://learn.microsoft.com/en-us/entra/identity/role-based-access-control/permissions-reference#privileged-role-administrator)) must authorize the Microsoft Entra ID Application you provision (with [provided terraform modules](infra/examples)) to access your Microsoft 365 tenant's data with the scopes listed below. This is done via the [Microsoft Entra admin center](https://entra.microsoft.com). If you use our provided Terraform modules, specific instructions that you can pass to the Microsoft 365 Admin will be output for you. +For all of these, a Microsoft 365 Admin (at minimum, a [Privileged Role Administrator](https://learn.microsoft.com/en-us/entra/identity/role-based-access-control/permissions-reference#privileged-role-administrator)) must authorize the Microsoft Entra ID Application you provision (with [provided terraform modules](https://github.com/Worklytics/psoxy/tree/main/infra/examples-dev)) to access your Microsoft 365 tenant's data with the scopes listed below. This is done via the [Microsoft Entra admin center](https://entra.microsoft.com). If you use our provided Terraform modules, specific instructions that you can pass to the Microsoft 365 Admin will be output for you. | Source                 | Examples    | Application Scopes | |--------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| @@ -242,7 +242,7 @@ You will need all the following in your deployment environment (eg, your laptop) NOTE: we will support Java versions for duration of official support windows, in particular the LTS versions. Minor versions may work but are not routinely tested. As of March 2026, officially tested versions include Java 21 (LTS), 25, and 26. -NOTE: Using `terraform` is not strictly necessary, but it is the only supported method. You may provision your infrastructure via your host's CLI, web console, or another infrastructure provisioning tool, but we don't offer documentation or support in doing so. Adapting one of our [terraform examples](https://github.com/Worklytics/psoxy/tree/main/infra/examples) or writing your own config that re-uses our [modules](https://github.com/Worklytics/psoxy/tree/main/infra/modules) will simplify things greatly. +NOTE: Using `terraform` is not strictly necessary, but it is the only supported method. You may provision your infrastructure via your host's CLI, web console, or another infrastructure provisioning tool, but we don't offer documentation or support in doing so. Adapting one of our [terraform examples](https://github.com/Worklytics/psoxy/tree/main/infra/examples-dev) or writing your own config that re-uses our [modules](https://github.com/Worklytics/psoxy/tree/main/infra/modules) will simplify things greatly. NOTE: from v0.6.x, we require Terraform 1.7.x as minimum. We strive to maintain compatibility with both OpenTofu and Terraform. diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index 6a1f2d5b14..b72535ba5f 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -45,7 +45,7 @@ * [Client IP Allowlisting](configuration/ip-allowlisting.md) * [TLS Version](configuration/tls.md) * [New Relic Monitoring](configuration/new-relic-monitoring.md) - * [Remote Resources **BETA**](configuration/remote-resources.md) + * [Remote Resources](configuration/remote-resources.md) * [Development](development/README.md) * [Approaches for Example / Module design](development/terraform-architecture.md) * [Create a private fork](development/private-fork.md) @@ -59,6 +59,7 @@ * [Claude](sources/anthropic/claude/README.md) * [Claude Enterprise Analytics](sources/anthropic/claude-enterprise-analytics/README.md) * [Claude Code](sources/anthropic/claude-code/README.md) + * [Claude Code Bulk](sources/anthropic/claude-code-bulk/README.md) * [Asana](sources/asana/README.md) * [Atlassian](sources/atlassian/README.md) * [Confluence Cloud](sources/atlassian/confluence/README.md) @@ -93,7 +94,6 @@ * [HRIS](sources/hris/README.md) * [Metrics](sources/metrics/README.md) * [Microsoft 365](sources/microsoft-365/README.md) - * [API Call Examples](sources/microsoft-365/example-api-calls.md) * [Entra ID](sources/microsoft-365/entra-id/README.md) * [Microsoft Copilot](sources/microsoft-365/msft-copilot/README.md) * [Microsoft Teams](sources/microsoft-365/msft-teams/README.md) @@ -102,6 +102,7 @@ * [Miro](sources/miro/README.md) * [Miro AI Bulk](sources/miro/miro-ai-bulk/README.md) * [Salesforce](sources/salesforce/README.md) + * [Sales for Copilot Bulk](sources/salesforce/sales-for-copilot/README.md) * [Slack](sources/slack/README.md) * [Slack AI Analytics Bulk](sources/slack/slack-ai-bulk/README.md) * [Slack Analytics](sources/slack/slack-analytics/README.md) diff --git a/docs/authentication-authorization.md b/docs/authentication-authorization.md index 892b35938d..de23357d4b 100644 --- a/docs/authentication-authorization.md +++ b/docs/authentication-authorization.md @@ -14,6 +14,8 @@ Worklytics is **authorized** to access your proxy instance via an Identity and A Worklytics **authenticates** in all cases via Workload Identity Federation; as your Worklytics tenant is running natively in the cloud, it can leverage the cloud provider's native IAM service to establish identity which can be asserted to other services in the cloud. +If you restrict proxy access by client IP, Worklytics can ensure fixed egress IP addresses for outbound requests from your tenant as a paid add-on. Contact [sales@worklytics.co](mailto:sales@worklytics.co) or see [Client IP Allowlisting](configuration/ip-allowlisting.md). + ## Proxy to Data Source API (2) Although exact details vary by data source, most utilize some form of [OAuth 2.0](https://oauth.net/2/) for authorization and authentication. @@ -22,7 +24,7 @@ A data source admin (eg, a Google Workspace admin) must **authorize** the proxy See [https://docs.worklytics.co/psoxy#supported-data-sources](https://docs.worklytics.co/psoxy#supported-data-sources) -The proxy **authenticates** itself for calls to the data source using one of the supported OAuth 2.0 mechanisms, see [https://oauth.net/2/client-authentication/]. Most commonly, these are [Client Credentials](https://oauth.net/2/grant-types/client-credentials/) or [Workload Identity Federation](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation). +The proxy **authenticates** itself for calls to the data source using one of the supported OAuth 2.0 mechanisms; see [https://oauth.net/2/client-authentication/](https://oauth.net/2/client-authentication/). Most commonly, these are [Client Credentials](https://oauth.net/2/grant-types/client-credentials/) or [Workload Identity Federation](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation). In particular, a quick overview for common sources: - Google Workspace sources authenticate via Client Credentials (a GCP Service Account key) diff --git a/docs/aws/authentication-authorization.md b/docs/aws/authentication-authorization.md index 512adee1b1..a886954d50 100644 --- a/docs/aws/authentication-authorization.md +++ b/docs/aws/authentication-authorization.md @@ -49,4 +49,6 @@ Then you use this AWS IAM role as the principal in AWS IAM policies you define t When `allowed_data_access_ip_blocks` or `allowed_webhook_ip_blocks` is set in Terraform, the AWS core module adds `aws:SourceIp` conditions to the **assume-role** policies for the deployment's caller roles. Principals cannot assume those roles unless the request originates from an allowed IP or CIDR. This is **infrastructure-level** enforcement, in addition to per-request checks inside the proxy. +Worklytics can ensure fixed egress IP addresses for outbound requests from your tenant as a paid add-on. Contact [sales@worklytics.co](mailto:sales@worklytics.co) for details. + See [Client IP Allowlisting](../configuration/ip-allowlisting.md). diff --git a/docs/aws/cloud-shell.md b/docs/aws/cloud-shell.md index b1a80ee931..1a455e622b 100644 --- a/docs/aws/cloud-shell.md +++ b/docs/aws/cloud-shell.md @@ -44,7 +44,7 @@ Then `source ~/.bashrc`, to execute the above. 4. if using Microsoft 365 data sources, install Azure CLI and authenticate. -https://docs.microsoft.com/en-us/cli/azure/install-azure-cli +[https://docs.microsoft.com/en-us/cli/azure/install-azure-cli](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli) You should now be ready for the general instructions in the [README.md](../../README.md). diff --git a/docs/aws/development.md b/docs/aws/development.md index 5f25604a62..fc1ff10959 100644 --- a/docs/aws/development.md +++ b/docs/aws/development.md @@ -26,7 +26,7 @@ mvn clean package Locally, you can test function's behavior from invocation on a JSON payload (but not how the API gateway will map HTTP requests to that JSON payload): -https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-using-invoke.html +[https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-using-invoke.html](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-using-invoke.html) ## Deploy to AWS diff --git a/docs/aws/guides/lambdas-on-vpc.md b/docs/aws/guides/lambdas-on-vpc.md index cd3a914ecc..ca7222c831 100644 --- a/docs/aws/guides/lambdas-on-vpc.md +++ b/docs/aws/guides/lambdas-on-vpc.md @@ -89,5 +89,5 @@ So: ## References -- https://docs.aws.amazon.com/lambda/latest/dg/foundation-networking.html -- https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html +- [https://docs.aws.amazon.com/lambda/latest/dg/foundation-networking.html](https://docs.aws.amazon.com/lambda/latest/dg/foundation-networking.html) +- [https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html](https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html) diff --git a/docs/aws/protips.md b/docs/aws/protips.md index bf280c0307..6ca4c5827a 100644 --- a/docs/aws/protips.md +++ b/docs/aws/protips.md @@ -20,7 +20,7 @@ default_tags = { If you're not using our AWS example, you can add the following to your configuration, then you will need to modify the `aws` provider block in your configuration to add a `default_tags`. Example shown below: -See: [https://registry.terraform.io/providers/hashicorp/aws/latest/docs#default_tags] +See: [https://registry.terraform.io/providers/hashicorp/aws/latest/docs#default_tags](https://registry.terraform.io/providers/hashicorp/aws/latest/docs#default_tags) ```hcl provider "aws" { @@ -176,7 +176,7 @@ resource "aws_s3_bucket_lifecycle_configuration" "data_buckets" { The terraform modules we provide provision execution roles for each lambda function, and attach by default attach the appropriate AWS Managed Policy to each. -Specifically, this is [`AWSLambdaBasicExecutionRole`](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaBasicExecutionRole.html), unless you're using a VPC - in which case it is `AWSLambdaVPCAccessExecutionRole`(https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaVPCAccessExecutionRole.html). +Specifically, this is [`AWSLambdaBasicExecutionRole`](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaBasicExecutionRole.html), unless you're using a VPC - in which case it is [`AWSLambdaVPCAccessExecutionRole`](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaVPCAccessExecutionRole.html). For organizations that don't allow use of AWS Managed Policies, you can use the `aws_lambda_execution_role_policy_arn` variable to pass in an alternative which will be used INSTEAD of the AWS Managed Policy. diff --git a/docs/aws/sbom.json b/docs/aws/sbom.json index e02bbf5f5e..cf403f83d7 100644 --- a/docs/aws/sbom.json +++ b/docs/aws/sbom.json @@ -1,10 +1,10 @@ { "bomFormat" : "CycloneDX", "specVersion" : "1.5", - "serialNumber" : "urn:uuid:04d17ffb-7c8c-39c0-ac9f-110da0a6fbde", + "serialNumber" : "urn:uuid:c0b002ce-cbd6-3584-9a9c-2c36a495f051", "version" : 1, "metadata" : { - "timestamp" : "2026-06-18T18:28:17Z", + "timestamp" : "2026-06-30T21:43:40Z", "lifecycles" : [ { "phase" : "build" @@ -59,9 +59,9 @@ "component" : { "group" : "co.worklytics.psoxy", "name" : "psoxy-aws", - "version" : "0.6.6", + "version" : "0.6.7", "licenses" : [ ], - "purl" : "pkg:maven/co.worklytics.psoxy/psoxy-aws@0.6.6?type=jar", + "purl" : "pkg:maven/co.worklytics.psoxy/psoxy-aws@0.6.7?type=jar", "externalReferences" : [ { "type" : "distribution-intake", @@ -69,7 +69,7 @@ } ], "type" : "library", - "bom-ref" : "pkg:maven/co.worklytics.psoxy/psoxy-aws@0.6.6?type=jar" + "bom-ref" : "pkg:maven/co.worklytics.psoxy/psoxy-aws@0.6.7?type=jar" }, "properties" : [ { @@ -86,40 +86,40 @@ { "group" : "co.worklytics.psoxy", "name" : "psoxy-core", - "version" : "0.6.6", + "version" : "0.6.7", "scope" : "required", "hashes" : [ { "alg" : "MD5", - "content" : "6354851ecef5541ffee800a87e945875" + "content" : "bf1ec602a1fa1ca55115146865465f3f" }, { "alg" : "SHA-1", - "content" : "bcf790c49ed33bdb1a36d4cfef49b13c61dce7e4" + "content" : "e1225df205fc44ef9b9d092f10dbb25fd011a072" }, { "alg" : "SHA-256", - "content" : "b5cf492919eaa982d95c6e4848a3e1e7649f28f2c45869f98ac0acc93fe3814b" + "content" : "ba69cd2cc003984447e7a121bc75b70112775fdbec78fc6815632a0441fd2bc7" }, { "alg" : "SHA-512", - "content" : "7d16243f1f920475c1d75b400002d868fddfbdebf9b080ea8e3f6c3c36253b279a8bea7f3ffefb188222ed626d34cbf99575970fe98ec293c720c96a4e924d89" + "content" : "65fe1fa40e36ef5a34f2889e08ce418182b335d694cc0c8a714de436aca6f64824ea7e3a6c628ca55a4be1e0c3bbef4f0f2104068cdb027f014d38f3619a4f85" }, { "alg" : "SHA-384", - "content" : "78226094335202889ef2faf96c9444bb7d2c0ce417eb2500d2c730548bfcfa3a6099ca0bb9fd23f6b645a08905e8ee43" + "content" : "c1dc0fe532d23a9d2cc96ff98a5ab6e2d1738593a196fe2cdcdc61986b72ee38eb5ea8343b8650d9c5ed10977cfff86f" }, { "alg" : "SHA3-384", - "content" : "480cc356ba982e35240c2c0209049fe5d3bf9e82982b7663412ef454ab1063265db8a875cfee6832b75c683ec644af00" + "content" : "78e4d62a1bd03810ceac35b6c847e2ff734f41c2b0ff3b4d0f67c78484f2f3750efffe4a887a9ee24a70ba71375e54af" }, { "alg" : "SHA3-256", - "content" : "b0f31e47b7b6400e00f9fd4b827ec48ff9d26e48479a21ba90f91cd341a46996" + "content" : "69fc9e98bceef5eec93ee58e213270fe48eadfa7439fb78fdd6cb9c0afcb1936" }, { "alg" : "SHA3-512", - "content" : "21e9b664483a864d0015e0b09035686c003534a168320b1b0eceff20c0c77dfd0ac1479707fbe960bad8ef71d0bea2695ff5e76a4546660c4ce39328c9ba6875" + "content" : "b776058d347820421cbb4cd11bb606c7e9ad4428016b559650aa957d34acaad2018a835c3aff46c480a420f8fbd9fc9316b7114f864995a6bf91fab98d493598" } ], "licenses" : [ @@ -130,47 +130,47 @@ } } ], - "purl" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.6?type=jar", + "purl" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.7?type=jar", "type" : "library", - "bom-ref" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.6?type=jar" + "bom-ref" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.7?type=jar" }, { "group" : "com.avaulta.gateway", "name" : "gateway-core", - "version" : "0.6.6", + "version" : "0.6.7", "scope" : "required", "hashes" : [ { "alg" : "MD5", - "content" : "5761afbecd3f4a0347e52c005c2a1c53" + "content" : "e5dbbaeef87a3b12bba0c230e992d2b6" }, { "alg" : "SHA-1", - "content" : "1430f70ebcdb96933a1b9e3d024275b2a5da2be5" + "content" : "7941578ecd6f373dd6a659e430b0135c422b497f" }, { "alg" : "SHA-256", - "content" : "bb1802066c21112d66b9e8339588da97aac032a1e4a9ef69303e8d08499f94f7" + "content" : "a13f5e67238769b73cd0c9ab9f7676b2eed99a7bd1678176d0628b42f603f32d" }, { "alg" : "SHA-512", - "content" : "3b977bbdfe053c841feb1e4170185bbd43618fdbe84ebe88bd8415e0b017f4dd0c320540eb62cc00e6b2a28788d38f5935c74e2e01663ce8c2a0ed12fcac66c6" + "content" : "0bdaec875f89fc0ecb4cdd19ba0e78d693128d34afcb8612a43b1329e42ef2d04dfa40067adcb40e40cf94c08abc64a922cad0269ca93bdd94b36b20e6409566" }, { "alg" : "SHA-384", - "content" : "6f8e74717126d333c18d38d698aa3f7043be87fa539428abeedc90851782848a65294b05f883111a591b855022465e3b" + "content" : "225a8f595ade81d68f1f9f53e96dd40da0014709821c07643a0f34a1a99cc95458ef337df289422cf1bb808acac74af2" }, { "alg" : "SHA3-384", - "content" : "2ea2a14d387b89a6bd6bb1e59a2943695562ae326137a0fd9642299c0f008dc3a31ad46954ef84ba07a69afb656875a6" + "content" : "e2b4b0e218920112f497ef719684b73718e096a0d7b231e1cd9f80f8e429caa148289dfff25bf7aec965e391d6bc69d8" }, { "alg" : "SHA3-256", - "content" : "0ed7c4b5cb3adffc86c07b4b6c565b025e5c0d87e2f36b4a957a2f4a02a8ba26" + "content" : "931c6faa7ce198ab51a6f07a1febe3c26d7c830afc8e0055632d6dbad73736d1" }, { "alg" : "SHA3-512", - "content" : "a3ec68bdd1937d69281f6bdb382dd46371cc49ddcc24ef900cdd3a556eb49e4a9a5c696830708b4be3b75cc69bc0d29ec5f1650a2e5985e4db14fd6dc8fd1f60" + "content" : "17e02c6a29108ac7d22171057a87f7d88bac5517d2910e7d4713fe61278ca05a9ea3883a5cc7ac01da0802a5938739febff5e89b196ceba6fe3135e15f4fce4c" } ], "licenses" : [ @@ -181,9 +181,9 @@ } } ], - "purl" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.6?type=jar", + "purl" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.7?type=jar", "type" : "library", - "bom-ref" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.6?type=jar" + "bom-ref" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.7?type=jar" }, { "publisher" : "The Apache Software Foundation", @@ -10157,9 +10157,9 @@ ], "dependencies" : [ { - "ref" : "pkg:maven/co.worklytics.psoxy/psoxy-aws@0.6.6?type=jar", + "ref" : "pkg:maven/co.worklytics.psoxy/psoxy-aws@0.6.7?type=jar", "dependsOn" : [ - "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.6?type=jar", + "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.7?type=jar", "pkg:maven/org.projectlombok/lombok@1.18.42?type=jar", "pkg:maven/com.google.dagger/dagger-compiler@2.40.5?type=jar", "pkg:maven/com.google.dagger/dagger@2.40.5?type=jar", @@ -10179,9 +10179,9 @@ ] }, { - "ref" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.6?type=jar", + "ref" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.7?type=jar", "dependsOn" : [ - "pkg:maven/com.avaulta.gateway/gateway-core@0.6.6?type=jar", + "pkg:maven/com.avaulta.gateway/gateway-core@0.6.7?type=jar", "pkg:maven/org.apache.commons/commons-lang3@3.19.0?type=jar", "pkg:maven/org.apache.commons/commons-csv@1.14.1?type=jar", "pkg:maven/commons-io/commons-io@2.18.0?type=jar", @@ -10206,7 +10206,7 @@ ] }, { - "ref" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.6?type=jar", + "ref" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.7?type=jar", "dependsOn" : [ "pkg:maven/org.apache.commons/commons-lang3@3.19.0?type=jar", "pkg:maven/org.apache.commons/commons-csv@1.14.1?type=jar", diff --git a/docs/aws/troubleshooting.md b/docs/aws/troubleshooting.md index c702ba554f..82cac9d230 100644 --- a/docs/aws/troubleshooting.md +++ b/docs/aws/troubleshooting.md @@ -52,7 +52,7 @@ AWS_PROFILE="production" terraform plan ``` References: -https://discuss.hashicorp.com/t/using-credential-created-by-aws-sso-for-terraform/23075/7 +[https://discuss.hashicorp.com/t/using-credential-created-by-aws-sso-for-terraform/23075/7](https://discuss.hashicorp.com/t/using-credential-created-by-aws-sso-for-terraform/23075/7) ## Your AWS User has MFA diff --git a/docs/configuration/api-data-sanitization.md b/docs/configuration/api-data-sanitization.md index 348d3b69a5..d9962cc514 100644 --- a/docs/configuration/api-data-sanitization.md +++ b/docs/configuration/api-data-sanitization.md @@ -219,7 +219,7 @@ A "response schema" is a "JSON Schema Filter" structure, specifying how response Our "JSON Schema Filter" implementation attempts to align to the [JSON Schema](https://json-schema.org/specification-links.html) specification, with some variation as it is intended for _filtering_ rather than _validation_. But generally speaking, you should be able to copy the JSON Schema for an API endpoint from its [OpenAPI specification](https://swagger.io/specification/) as a starting point for the `responseSchema` value in your rule set. Similarly, there are tools that can generate JSON Schema from example JSON content, as well as from data models in various languages, that may be useful. -See: [https://json-schema.org/implementations.html#schema-generators](https://json-schema.org/implementations.html#schema-generators) +See: [https://json-schema.org/tools](https://json-schema.org/tools) If a `responseSchema` attribute is specified for an `endpoint`, the response content will be _filtered_ (rather than validated) against that schema. Eg, fields NOT specified in the schema, or not of expected type, will be removed from the response. diff --git a/docs/configuration/bulk-file-sanitization.md b/docs/configuration/bulk-file-sanitization.md index b281dc82d9..028048def9 100644 --- a/docs/configuration/bulk-file-sanitization.md +++ b/docs/configuration/bulk-file-sanitization.md @@ -283,4 +283,4 @@ If you encounter issues processing your files, check the logs of the Psoxy insta 1. Use compression in the file (see [Compression](#compression)); if already compressed, then: 2. Split the file into smaller files and process them separately 3. (AWS only) Update the proxy version to v0.4.55 or later -4. (AWS only) If in v0.4.55 or later, process the files one by one or increase the ephemeral storage allocated to the Lambda function (see https://aws.amazon.com/blogs/aws/aws-lambda-now-supports-up-to-10-gb-ephemeral-storage/) +4. (AWS only) If in v0.4.55 or later, process the files one by one or increase the ephemeral storage allocated to the Lambda function (see [https://aws.amazon.com/blogs/aws/aws-lambda-now-supports-up-to-10-gb-ephemeral-storage/](https://aws.amazon.com/blogs/aws/aws-lambda-now-supports-up-to-10-gb-ephemeral-storage/)). diff --git a/docs/configuration/remote-resources.md b/docs/configuration/remote-resources.md index bd8c4c8ded..b53514a579 100644 --- a/docs/configuration/remote-resources.md +++ b/docs/configuration/remote-resources.md @@ -1,6 +1,7 @@ -# Remote Resources (beta) +# Remote Resources -> **Status: beta** — This feature is functional but may evolve. Feedback welcome. +> [!NOTE] +> This feature is in beta. It is functional but may evolve; feedback welcome. Psoxy supports loading resources (sanitization rules, NLP models, etc.) from a remote cloud storage bucket (S3 on AWS, GCS on GCP). This enables configuration that is too large for environment @@ -28,13 +29,19 @@ mounted locally. ## Terraform Configuration -By default, the host modules in this repository (`aws-host` and `gcp-host`) will configure the -`REMOTE_RESOURCE_BUCKET` for you if you set the `enable_remote_resources` variable to `true`. This -automatically wires the **artifacts bucket** (used for deployment bundles) as the remote resource bucket. +Remote resources are **opt-in**. Set `enable_remote_resources = true` on the host module +(`aws-host` or `gcp-host`) when you want psoxy to load rules, NLP models, or other assets from +the artifacts bucket at runtime. The host module does not infer this from your connector list. + +When enabled, the host module uses the artifacts bucket — either one you provide +(`artifacts_bucket_name` / `custom_artifacts_bucket_name`), one already provisioned for a local +deployment bundle, or a newly provisioned bucket when using a prebuilt `s3://` / `gs://` +`deployment_bundle`. > [!IMPORTANT] -> - If you configure an existing bucket (e.g., by providing `artifacts_bucket_name`), the bucket must already exist. -> - The Terraform runner (the credentials running the `terraform` command) must have sufficient IAM permissions on that bucket to apply permissions (since it will grant read access to the proxy's service account or Lambda execution role). +> If you supply an existing bucket (`artifacts_bucket_name` / `custom_artifacts_bucket_name`), it must already exist. +> +> The Terraform runner (the credentials running the `terraform` command) must have sufficient IAM permissions on that bucket to apply permissions, since it will grant read access to the proxy's service account or Lambda execution role. ### AWS (`aws-host`) @@ -44,7 +51,6 @@ module "psoxy" { # ... existing configuration ... - # Enable remote resource loading from the artifacts S3 bucket enable_remote_resources = true } ``` @@ -61,7 +67,6 @@ module "psoxy" { # ... existing configuration ... - # Enable remote resource loading from the artifacts GCS bucket enable_remote_resources = true } ``` diff --git a/docs/development/compression.md b/docs/development/compression.md index 1bf3f00f8e..841c9034f1 100644 --- a/docs/development/compression.md +++ b/docs/development/compression.md @@ -38,7 +38,7 @@ request for compressed response, and then compress the response. API Gateway is no longer used by our default terraform examples. But compression can be enabled at the gateway level (rather than relying on function url implementation, or in addition to). -https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gzip-compression-decompression.html +[https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gzip-compression-decompression.html](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gzip-compression-decompression.html) ### GCP @@ -47,7 +47,7 @@ GCP Cloud Functions will handle compression themselves IF the request meets vari There is no explicit, Cloud Function-specific documentation about this, but it seems that the behavior for App Engine applies: -https://cloud.google.com/appengine/docs/legacy/standard/go111/how-requests-are-handled#:~:text=For%20responses%20that%20are%20returned,HTML%2C%20CSS%2C%20or%20JavaScript. +[https://cloud.google.com/appengine/docs/legacy/standard/go111/how-requests-are-handled#:~:text=For%20responses%20that%20are%20returned,HTML%2C%20CSS%2C%20or%20JavaScript.](https://cloud.google.com/appengine/docs/legacy/standard/go111/how-requests-are-handled#:~:text=For%20responses%20that%20are%20returned,HTML%2C%20CSS%2C%20or%20JavaScript.) ## Source-to-Proxy Response diff --git a/docs/development/maven-artifacts.md b/docs/development/maven-artifacts.md index d4c8a67d63..b46e7a3b64 100644 --- a/docs/development/maven-artifacts.md +++ b/docs/development/maven-artifacts.md @@ -13,7 +13,7 @@ To consume packages from GitHub Packages, you need: 1. A GitHub account 2. A GitHub Personal Access Token (PAT) with `read:packages` permission - - Create one at: https://github.com/settings/tokens + - Create one at: [https://github.com/settings/tokens](https://github.com/settings/tokens) - Select the `read:packages` scope ### Maven Configuration diff --git a/docs/development/releases.md b/docs/development/releases.md index 4d28acb060..12317b2bef 100644 --- a/docs/development/releases.md +++ b/docs/development/releases.md @@ -32,8 +32,11 @@ git commit -a -m "update deps in psoxy-test" TODO: review versions of terraform, java, node uses in github actions. Ensure we're explicitly using the latest of each, and that we're ALSO testing explicitly for latest-1 version, even if it's not officially supported still. -QA aws, gcp dev examples by running `terraform apply` for each, and testing various connectors. +QA aws and gcp dev examples before merging. See [release QA runbook](../../tools/release/release-qa.md) or run: +```shell +./tools/release/run-release-qa.sh vX.Y.Z +``` Create PR to merge `rc-` to `main`. diff --git a/docs/development/webhook-collectors.md b/docs/development/webhook-collectors.md index 76f998cadd..d1c3946c47 100644 --- a/docs/development/webhook-collectors.md +++ b/docs/development/webhook-collectors.md @@ -43,7 +43,7 @@ Webhooks will always be written as NDJSON (newline-delimited JSON) to the output To do this efficiently, we'd split into 2 steps: 1) webhook collector that receives webhook payload, accepts + sanitizes it, and then sends it to SQS. Then separately a trigger that 2) batches messages from SQS and writes them to the output bucket as NDJSON files. -https://docs.aws.amazon.com/lambda/latest/dg/services-sqs-configure.html +[https://docs.aws.amazon.com/lambda/latest/dg/services-sqs-configure.html](https://docs.aws.amazon.com/lambda/latest/dg/services-sqs-configure.html) To handle both in same lambda, we need `WebhookCollectionModeHandler` to handle streams, and parse whether those are direct invocations of the webhook collector or SQS message batches diff --git a/docs/gcp/authentication-authorization.md b/docs/gcp/authentication-authorization.md index 0d3967faa9..adf0f23bf0 100644 --- a/docs/gcp/authentication-authorization.md +++ b/docs/gcp/authentication-authorization.md @@ -20,6 +20,8 @@ You can obtain the identity of your Worklytics tenant's GCP service account from When `allowed_data_access_ip_blocks` or `allowed_webhook_ip_blocks` is set in Terraform, the proxy enforces allowlists **inside the Cloud Function** via `ALLOWED_DATA_ACCESS_IP_BLOCKS` and `ALLOWED_WEBHOOK_IP_BLOCKS` environment variables. The shipped modules do **not** add source-IP IAM conditions on Cloud Run invoker bindings (GCP does not support that pattern on `roles/run.invoker`). +Worklytics can ensure fixed egress IP addresses for outbound requests from your tenant as a paid add-on. Contact [sales@worklytics.co](mailto:sales@worklytics.co) for details. + For network ingress filtering in front of Cloud Run (for example Cloud Armor on a load balancer), see [GCP Private Service Connect and connectivity options](../development/gcp-private-service-connect.md#enhancing-public-internet-options-with-ip-allowlisting). That is separate from the Terraform allowlist variables. See [Client IP Allowlisting](../configuration/ip-allowlisting.md). diff --git a/docs/gcp/guides/lookup-tables.md b/docs/gcp/guides/lookup-tables.md index 4427c8c90b..dd0d7b7265 100644 --- a/docs/gcp/guides/lookup-tables.md +++ b/docs/gcp/guides/lookup-tables.md @@ -2,7 +2,7 @@ If you use Psoxy to send pseudonymized data to Worklytics and later wish to re-identify the data that you export from Worklytics to your premises, you'll need a lookup table in your data warehouse to JOIN with that data. -Our `gcp-host` Terraform module, as used in our [Psoxy GCS Example](https://github.com/Worklytics/psoxy-example-gcs/tree/main), provides a variable `lookup_tables` to control generation of these lookup tables. +Our `gcp-host` Terraform module, as used in our [Psoxy GCP Example](https://github.com/Worklytics/psoxy-example-gcp/tree/main), provides a variable `lookup_tables` to control generation of these lookup tables. Populating this variable will generate another version of your HRIS data (aside from the one exposed to Worklytics) which you can then import back to your data warehouse. diff --git a/docs/gcp/sbom.json b/docs/gcp/sbom.json index 51e4112cfe..92a27f9e55 100644 --- a/docs/gcp/sbom.json +++ b/docs/gcp/sbom.json @@ -1,10 +1,10 @@ { "bomFormat" : "CycloneDX", "specVersion" : "1.5", - "serialNumber" : "urn:uuid:c05f478f-e25e-3ca5-b754-23273b8f4267", + "serialNumber" : "urn:uuid:9059d82b-181e-35e7-b147-52818670473e", "version" : 1, "metadata" : { - "timestamp" : "2026-06-18T18:28:44Z", + "timestamp" : "2026-06-30T21:44:17Z", "lifecycles" : [ { "phase" : "build" @@ -59,9 +59,9 @@ "component" : { "group" : "co.worklytics.psoxy", "name" : "psoxy-gcp", - "version" : "0.6.6", + "version" : "0.6.7", "licenses" : [ ], - "purl" : "pkg:maven/co.worklytics.psoxy/psoxy-gcp@0.6.6?type=jar", + "purl" : "pkg:maven/co.worklytics.psoxy/psoxy-gcp@0.6.7?type=jar", "externalReferences" : [ { "type" : "distribution-intake", @@ -69,7 +69,7 @@ } ], "type" : "library", - "bom-ref" : "pkg:maven/co.worklytics.psoxy/psoxy-gcp@0.6.6?type=jar" + "bom-ref" : "pkg:maven/co.worklytics.psoxy/psoxy-gcp@0.6.7?type=jar" }, "properties" : [ { @@ -86,40 +86,40 @@ { "group" : "co.worklytics.psoxy", "name" : "psoxy-core", - "version" : "0.6.6", + "version" : "0.6.7", "scope" : "required", "hashes" : [ { "alg" : "MD5", - "content" : "87eddc8e5f702c1d998d1acdde344e74" + "content" : "a17ba39408f20db81bb93cf7c7ec547d" }, { "alg" : "SHA-1", - "content" : "3f2032603ef48fb9120d307918bd46620530ecc2" + "content" : "d97bf36b4ddbafcf06e066b25be0a82845b3921a" }, { "alg" : "SHA-256", - "content" : "c7be10b9ead4a542f4c2c326d74efb413f941c527511864ea41e868b1a9c5880" + "content" : "405f9b46a2c55cf7788a5bce70d95b25df235e5b7a9afef73dae42160739f199" }, { "alg" : "SHA-512", - "content" : "43fcdc7676cc91ae1ed38b4b88b4d82612bfa60cf1583bdbb82a1a1638fb1354a78c131386a8f240a022f8a7a9f14e34b686bea7e1462a6329c8dce741992815" + "content" : "fd176a40804d00be57b1640b065963d12a1383da7bec6c3607d37e1f99fc9ae1258b7e0d77a2757ab4ef0f41035bcee88fa788057e6e00d36b532d202432c5b4" }, { "alg" : "SHA-384", - "content" : "4fe58a6da8ba4c00d4dd08ef302e890cedd03975a7c431175968c6812cc6e84feb32c1f88e6dc546af10a1c70213a9e6" + "content" : "401b60509be776510a146a7806ccace87a9ba909b77a5b2f3414d7201bc06a18f983fc12410fb548f2e1d0ace588ce0f" }, { "alg" : "SHA3-384", - "content" : "50a2178426c532494f967aba1441d7412a6e0faa5fe539fd5c577ffc114ff589373855b4db3d69f28d18a704241d8eca" + "content" : "91e61ed70cbfdf78a59b6c2e5efdbba7e9e232d3030203ca6af7ccfe1f43c0ec96590ddd6967dce9f1015fad4c63e8ff" }, { "alg" : "SHA3-256", - "content" : "5a6ef60a278035c6f92212058fe8b9ee6c87b4adc27ef9739e3b5b35c399150e" + "content" : "2cc9772be46ecb2c3d4b337baa23bde25e33ab1a6d41f98957b5c9a75bf6c6bb" }, { "alg" : "SHA3-512", - "content" : "a9b6b25f67523199f6f0bed8931ed3b3c6ee461762d5a371ef8684ac6798a19b3877d6fe3537636cac80dfdfe0b1500fc74504d1f882685fabccb6252c197128" + "content" : "dcea1dbdf1e40cfc39788b2a7e90a3fb36e4efafbee919d383808dc535ad23022e72c95c45120ae02e9a122ca6660620573bdf22c0e5761005eb1911099eeecf" } ], "licenses" : [ @@ -130,47 +130,47 @@ } } ], - "purl" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.6?type=jar", + "purl" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.7?type=jar", "type" : "library", - "bom-ref" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.6?type=jar" + "bom-ref" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.7?type=jar" }, { "group" : "com.avaulta.gateway", "name" : "gateway-core", - "version" : "0.6.6", + "version" : "0.6.7", "scope" : "required", "hashes" : [ { "alg" : "MD5", - "content" : "3695b96697effac61ec4284edd27087e" + "content" : "fbeb7519e3acdc6f46fd05ce2c6dae67" }, { "alg" : "SHA-1", - "content" : "868705aed1f6306c1cb47a5845f0f744feaf053d" + "content" : "36e298d96f45781d912e7e25f5511e68956906f1" }, { "alg" : "SHA-256", - "content" : "c3ca35438db7144a36ead65d6c05abf45a611bdc7a39fedc6bed4d904a3c8eba" + "content" : "4c4a60c4ec67364b0a8425798410e4148b6aa2c530a644c9cee0427f480ebd6a" }, { "alg" : "SHA-512", - "content" : "d0fc39ac5c829b9850f341f762061e56e351be3336e543542b4f00cdb2785302a37abd3d80c289d33f0c8f8ba94ea4e9d703b1d1401f11302fe95f2a72e1b827" + "content" : "a7f00f52a2e75358e8be60d001e12e56e4337f05702bd20cac6ba0cf7943e34d1a87d9a2c29b7b6568ba8e39d45666f20ff4eea38a2128456871157e378520d6" }, { "alg" : "SHA-384", - "content" : "2d5f32b9f2c8327ece89422d831d2a61adeeba825b98d8c5822c5585fcf61bd0df041f0e8298920ae50382f17a159d6f" + "content" : "5415de37a97e4323616484316176fe1d564308a37255e5dafabb2bae5c76909b8f078ae3bd1c271b21a1c56c79916c4c" }, { "alg" : "SHA3-384", - "content" : "e32df957913151cc9afe2f9c58f0fdd16150a78f9475060bf1bb7ee2008cabb59f41069d727c5fca24b0071d524b1498" + "content" : "661394aad82be83222495690fd5c41a7659024487f2d6c86312241bee6607bb844f0302f83d7bd96e56513b0fd937f48" }, { "alg" : "SHA3-256", - "content" : "537955bb8865270fc345fe5f9e9c90c440fa04c5f9934c88cc907b8eec410312" + "content" : "d7f99bd299f9d649f827e1cad170a56ee20c1dff64f741f598b8a5edbd022ef2" }, { "alg" : "SHA3-512", - "content" : "2a0a3b1d740ef1d0c381ce58e0b0325f7bdc0d08250075cc6a6a833389c84b5333d97af249cd763f9567c49e22b34bb9dd36a4b2b79060e7b9df0f35f7524bed" + "content" : "f83031b85723fb74b4f75cfb1838271b667d1a6e2d8cc5d55f5ca6aea273f89d5a8c3ad4e9893e3ce17db235aaec59562e39d14bd4efd8b8f5f86232549da6b5" } ], "licenses" : [ @@ -181,9 +181,9 @@ } } ], - "purl" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.6?type=jar", + "purl" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.7?type=jar", "type" : "library", - "bom-ref" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.6?type=jar" + "bom-ref" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.7?type=jar" }, { "publisher" : "FasterXML", @@ -10589,9 +10589,9 @@ ], "dependencies" : [ { - "ref" : "pkg:maven/co.worklytics.psoxy/psoxy-gcp@0.6.6?type=jar", + "ref" : "pkg:maven/co.worklytics.psoxy/psoxy-gcp@0.6.7?type=jar", "dependsOn" : [ - "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.6?type=jar", + "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.7?type=jar", "pkg:maven/org.projectlombok/lombok@1.18.42?type=jar", "pkg:maven/com.google.dagger/dagger@2.40.5?type=jar", "pkg:maven/com.google.dagger/dagger-compiler@2.40.5?type=jar", @@ -10604,9 +10604,9 @@ ] }, { - "ref" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.6?type=jar", + "ref" : "pkg:maven/co.worklytics.psoxy/psoxy-core@0.6.7?type=jar", "dependsOn" : [ - "pkg:maven/com.avaulta.gateway/gateway-core@0.6.6?type=jar", + "pkg:maven/com.avaulta.gateway/gateway-core@0.6.7?type=jar", "pkg:maven/org.apache.commons/commons-lang3@3.19.0?type=jar", "pkg:maven/org.apache.commons/commons-csv@1.14.1?type=jar", "pkg:maven/commons-io/commons-io@2.18.0?type=jar", @@ -10631,7 +10631,7 @@ ] }, { - "ref" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.6?type=jar", + "ref" : "pkg:maven/com.avaulta.gateway/gateway-core@0.6.7?type=jar", "dependsOn" : [ "pkg:maven/org.apache.commons/commons-lang3@3.19.0?type=jar", "pkg:maven/org.apache.commons/commons-csv@1.14.1?type=jar", diff --git a/docs/gcp/troubleshooting.md b/docs/gcp/troubleshooting.md index 3b4ab01f2b..9063324e7e 100644 --- a/docs/gcp/troubleshooting.md +++ b/docs/gcp/troubleshooting.md @@ -35,7 +35,7 @@ If you receive an error such as: Error: Error applying IAM policy for cloudfunctions cloudfunction googleapi: Error 400: One or more users named in the policy do not belong to a permitted customer. ``` -This may be due to an [Organization Policy](https://cloud.google.com/resource-manager/docs/organization-policy/overview) that restricts the domains that can be used in IAM policies. See https://cloud.google.com/resource-manager/docs/organization-policy/restricting-domains +This may be due to an [Organization Policy](https://cloud.google.com/resource-manager/docs/organization-policy/overview) that restricts the domains that can be used in IAM policies. See [https://cloud.google.com/resource-manager/docs/organization-policy/restricting-domains](https://cloud.google.com/resource-manager/docs/organization-policy/restricting-domains). You may need define an exception for the GCP project in which you're deploying the proxy, or add the domain of your Worklytics Tenant SA to the list of allowed domains. diff --git a/docs/guides/psoxy-test-tool.md b/docs/guides/psoxy-test-tool.md index 56f95a7d7e..69eb922275 100644 --- a/docs/guides/psoxy-test-tool.md +++ b/docs/guides/psoxy-test-tool.md @@ -164,7 +164,7 @@ available options (keep sanitized file in the output bucket, save it to disk, et [signed]: https://docs.aws.amazon.com/general/latest/gr/signing_aws_api_requests.html [Google Calendar]: https://developers.google.com/calendar/api [Zoom]: https://zoom.us -[Zoom API endpoint]: https://marketplace.zoom.us/docs/api-reference/zoom-api/methods/#operation/users +[Zoom API endpoint]: https://developers.zoom.us/docs/api/rest/reference/zoom-api/methods/#operation/users [Google Cloud SDK]: https://cloud.google.com/sdk/gcloud/reference/auth/print-identity-token [authorize gcloud first]: https://cloud.google.com/sdk/gcloud/reference/auth/login [S3]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html diff --git a/docs/guides/terraform-cloud.md b/docs/guides/terraform-cloud.md index 952691537f..3f27e6f4a8 100644 --- a/docs/guides/terraform-cloud.md +++ b/docs/guides/terraform-cloud.md @@ -8,7 +8,7 @@ NOTE: this is tested only for gcp; for aws YMMV, and in particular we expect Mic Prereqs: -- git/java/maven, as described here https://github.com/Worklytics/psoxy#required-software-and-permissions +- git/java/maven, as described here [https://github.com/Worklytics/psoxy#required-software-and-permissions](https://github.com/Worklytics/psoxy#required-software-and-permissions) - for testing, you'll need the CLI of your host environment (eg, AWS CLI, GCloud CLI, Azure CLI) as well as npm/NodeJS installed on your local machine After authenticating your terraform CLI to Terraform Cloud/enterprise, you'll need to: @@ -47,7 +47,7 @@ To get them nicely on your local machine, something like the following: ### Terraform API -1. get an API token from your Terraform Cloud or Enterprise instance (eg, https://developer.hashicorp.com/terraform/cloud-docs/users-teams-organizations/api-tokens). +1. get an API token from your Terraform Cloud or Enterprise instance (eg, [https://developer.hashicorp.com/terraform/cloud-docs/users-teams-organizations/api-tokens](https://developer.hashicorp.com/terraform/cloud-docs/users-teams-organizations/api-tokens)). 2. set it as an env variable, as well as the host: diff --git a/docs/guides/upgrading-versions.md b/docs/guides/upgrading-versions.md index dbf3df2c05..dc0300a9a3 100644 --- a/docs/guides/upgrading-versions.md +++ b/docs/guides/upgrading-versions.md @@ -14,7 +14,67 @@ To ease upgrading versions, our example repos ([psoxy-example-aws](https://githu This will update all the versions references throughout your example, and offer you a command to revert if you later wish to do so. A `terraform init` with the appropriate `-upgrade` flag will be run automatically. -After this, you must still run `terraform apply` to apply the changes to your infrastructure. (we recommend `terraform plan` first to preview the changes). +After this, you must still run `terraform apply` to apply the changes to your infrastructure. + +Run `terraform plan` first to preview what will change. To keep a copy for review, redirect the output to a dated file: + +```shell +terraform plan -no-color > "$(date +%Y%m%d-%H%M%S)-upgrade-plan.txt" 2>&1 +``` + +The `./upgrade-terraform-modules` script prints an exact capture command when it finishes. Consider sharing the saved plan with Worklytics support, teammates, or an LLM before you apply. + +## Reviewing your Terraform plan + +Before running `terraform apply`, review your plan output for changes that are difficult or impossible to undo without operational work outside Terraform. + +### High-risk changes to watch for + +**Rotating or destroying the pseudonymization SALT value/secret** + +Any data processed with the prior SALT value will be inconsistent with data processed after the change (pseudonyms for the same identifier will differ). You must either restore the prior SALT value or re-ingest all affected data to Worklytics. + +In Terraform plans, look for `random_password.pseudonym_salt` (or equivalent secret resources) being destroyed/replaced, or SSM parameters / Secret Manager secrets holding `PSOXY_SALT` being recreated. + +**Replacing Lambda or Cloud Function resources — especially their function URLs** + +Replacing proxy compute changes the endpoint Worklytics calls for API-mode connectors. Update the corresponding connections in Worklytics with the new function URL(s) after apply. + +**Replacing any `-input` buckets** + +Bulk connectors receive files via `-input` buckets. If Terraform replaces these buckets, update any data pipelines, export jobs, or manual upload processes that write files into them. + +**Replacing any `-sanitized` buckets** + +Worklytics reads sanitized bulk output from these buckets. If they are replaced, update the corresponding connections in Worklytics (bucket name, path, and any IAM principal used for access). + +**Replacing parameters/secrets that hold API credentials and are NOT managed by this Terraform configuration** + +Some deployments store third-party API keys in SSM, Secrets Manager, or GCP Secret Manager outside the modules Terraform manages. If a plan destroys or replaces those resources, you must recover the original credential values (from backup or your secrets store) or obtain new credentials from the data source and update both Terraform and the live secret before apply completes. + +**Replacing the IAM role used by Worklytics to invoke cloud functions or read from `-sanitized` buckets** + +Worklytics connections reference the principal that calls your proxy (function URL invoker role, or role/user that reads sanitized buckets). If that role is replaced, update the corresponding connections in Worklytics. + +### Getting help reviewing the plan + +After saving your plan to a file, share it with Worklytics support, a teammate, or an LLM to help scan for the issues above. Example prompt for an LLM: + +```text +Review and summarize the output of terraform plan stored in 20260618-143022-upgrade-plan.txt. + +Flag any high-risk changes, especially: +- destruction or replacement of the pseudonymization SALT/secret +- replacement of Lambda/Cloud Function resources (and their function URLs) +- replacement of any -input buckets +- replacement of any -sanitized buckets +- replacement of parameters/secrets holding API credentials that are NOT managed by this Terraform configuration +- replacement of the IAM role used by Worklytics to invoke cloud functions or read from -sanitized buckets + +For each issue found, explain the operational impact and what I must do before applying this plan. +``` + +Replace the filename with your actual plan file. Do not paste live secrets or credentials into third-party tools; the plan file itself should not contain secret values if Terraform is configured correctly, but review your organization's policies before sharing plan output externally. ## Legacy Deployments (Initial version pre-`v0.4.30`) If you initially used one of our examples prior to `v0.4.30`, or did not use one of our examples, you will need to manually update the version references in your configuration. diff --git a/docs/prereqs-ubuntu.md b/docs/prereqs-ubuntu.md index 3bfc4d10cb..330acbc41c 100644 --- a/docs/prereqs-ubuntu.md +++ b/docs/prereqs-ubuntu.md @@ -67,6 +67,6 @@ gcloud auth application-default login --no-launch-browser 6. if using Microsoft 365 data sources, install Azure CLI and authenticate. -https://docs.microsoft.com/en-us/cli/azure/install-azure-cli +[https://docs.microsoft.com/en-us/cli/azure/install-azure-cli](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli) You should now be ready for the general instructions in the [README.md](README.md). diff --git a/docs/sources/README.md b/docs/sources/README.md index 986864eb57..a2f5f31e8a 100644 --- a/docs/sources/README.md +++ b/docs/sources/README.md @@ -12,7 +12,7 @@ To add a source, add its Connector ID to the `enabled_connectors` list in your ` | `atlassian-organization` | [Atlassian Organization](atlassian/organization/README.md) | API | BETA | | `azure-ad` | [Azure Active Directory](microsoft-365/entra-id/README.md) | API | DEPRECATED | | `badge` | [Badge](badge/README.md) | Bulk | GA | -| `chatgpt-enterprise` | [ChatGPT Enterprise](chatgpt-enterprise/README.md) | API | ALPHA | +| `chatgpt-enterprise` | [ChatGPT Enterprise](chatgpt-enterprise/README.md) | API | BETA | | `claude` | [Claude](anthropic/claude/README.md) | API | BETA | | `claude-enterprise-analytics` | [Claude Enterprise Analytics](anthropic/claude-enterprise-analytics/README.md) | API | BETA | | `claude-code` | [Claude Code](anthropic/claude-code/README.md) | API | BETA | @@ -56,11 +56,13 @@ To add a source, add its Connector ID to the `enabled_connectors` list in your ` The following additional bulk connectors are documented but configured via `custom_bulk_connectors` in Terraform rather than `enabled_connectors`: -| Connector ID / key | Data Source | Type | Availability | -|----------------------|----------------------------------------------------------------------|------|--------------| -| `gong-bulk` | [Gong Bulk](gong/gong-bulk/README.md) | Bulk | ALPHA | -| `miro-ai-bulk` | [Miro AI Bulk](miro/miro-ai-bulk/README.md) | Bulk | ALPHA | -| `slack-discovery-bulk` | [Slack Bulk Exports](slack/slack-discovery-bulk/README.md) | Bulk | GA | +| Connector ID / key | Data Source | Type | Availability | +|----------------------|-------------------------------------------------------------------------|------|--------------| +| `claude-code-bulk` | [Claude Code Bulk](anthropic/claude-code-bulk/README.md) | Bulk | BETA | +| `gong-bulk` | [Gong Bulk](gong/gong-bulk/README.md) | Bulk | ALPHA | +| `miro-ai-bulk` | [Miro AI Bulk](miro/miro-ai-bulk/README.md) | Bulk | ALPHA | +| `sales-for-copilot` | [Sales for Copilot Bulk](salesforce/sales-for-copilot/README.md) | Bulk | BETA | +| `slack-discovery-bulk` | [Slack Bulk Exports](slack/slack-discovery-bulk/README.md) | Bulk | GA | | `zoom-ai-metrics` | [Zoom AI Metrics Snapshot](zoom/README.md#zoom-ai-metric-snapshot-bulk) | Bulk | ALPHA | From v0.4.58, you can confirm the availability of a connector by running the following command from the root of one of our examples: diff --git a/docs/sources/anthropic/claude-code-bulk/README.md b/docs/sources/anthropic/claude-code-bulk/README.md new file mode 100644 index 0000000000..412c5e79fb --- /dev/null +++ b/docs/sources/anthropic/claude-code-bulk/README.md @@ -0,0 +1,56 @@ +# Claude Code Bulk + +**Connector ID:** `claude-code-bulk` + +**Availability:** Beta + +Psoxy can pseudonymize Claude Code usage events exported as CSV (or NDJSON) bulk files for ingestion into Worklytics. Each row captures one usage event — user, model, and token consumption. + +When this bulk connector is installed, its aggregates take priority over the [Claude Code API connector](../claude-code/README.md) aggregates, so both connectors should not run concurrently. + +## Instructions to Connect + +1. Copy [claude-code-bulk-rules.yaml](claude-code-bulk-rules.yaml) to your Terraform working directory. +2. Add a `claude-code-bulk` entry to your `custom_bulk_connectors` in `terraform.tfvars`, then run `terraform apply`. +3. Review your Terraform output; find the `-input` bucket name for the connector. +4. Export Claude Code usage data from your system as CSV. See [claude-code-bulk-sample.csv](claude-code-bulk-sample.csv) for the expected shape. +5. Upload files to the `-input` bucket. Include a date/timestamp in the filename (e.g. `claude-code-bulk-20260101T000000Z.csv`) so repeated uploads are stored as separate objects. +6. Create a **Bulk Import - Psoxy** connection in Worklytics with `claude-code-bulk` as the parser; see the TODO file generated by `terraform apply`. +7. Repeat steps 4–5 periodically to keep data up to date. + +```hcl +custom_bulk_connectors = { + "claude-code-bulk" = { + source_kind = "claude-code" + worklytics_connector_id = "bulk-import-psoxy" + worklytics_connector_name = "Bulk Import - Psoxy" + display_name = "Claude Code Bulk" + rules_file = "claude-code-bulk-rules.yaml" + settings_to_provide = { + "Parser" = "claude-code-bulk" + } + } +} +``` + +## File Schema + +| Field | Type | Required | Description | +|-----------------|---------|----------|-------------| +| `userEmail` | string | yes | Email of the user who triggered the event. Pseudonymized by Psoxy. | +| `timestamp` | integer | yes | Event time as Unix epoch milliseconds. | +| `ai_tool` | string | no | AI tool identifier (e.g. `claude-code`). | +| `model` | string | no | Underlying model name (e.g. `claude-sonnet-4-6`). | +| `input_tokens` | integer | no | Input/prompt tokens consumed. | +| `output_tokens` | integer | no | Output/completion tokens generated. | +| `event_type` | string | no | Type of event, if available (open enum, e.g. `autocomplete`). | + +## Sanitization + +Only `userEmail` contains PII. All other fields are passed through unchanged. + +See [claude-code-bulk-rules.yaml](claude-code-bulk-rules.yaml) for the full rule set. + +## Example Data + +- [claude-code-bulk-sample.csv](claude-code-bulk-sample.csv) diff --git a/docs/sources/anthropic/claude-code-bulk/claude-code-bulk-rules.yaml b/docs/sources/anthropic/claude-code-bulk/claude-code-bulk-rules.yaml new file mode 100644 index 0000000000..5e4839490d --- /dev/null +++ b/docs/sources/anthropic/claude-code-bulk/claude-code-bulk-rules.yaml @@ -0,0 +1,3 @@ +format: AUTO +transforms: + - pseudonymize: "$.userEmail" diff --git a/docs/sources/anthropic/claude-code-bulk/claude-code-bulk-sample.csv b/docs/sources/anthropic/claude-code-bulk/claude-code-bulk-sample.csv new file mode 100644 index 0000000000..559eef4af7 --- /dev/null +++ b/docs/sources/anthropic/claude-code-bulk/claude-code-bulk-sample.csv @@ -0,0 +1,8 @@ +userEmail,timestamp,ai_tool,model,input_tokens,output_tokens,event_type +alice@acme-corp.com,1753344000000,claude-code,claude-sonnet-4-6,1500,320, +alice@acme-corp.com,1753344900000,claude-code,claude-sonnet-4-6,820,195,autocomplete +alice@acme-corp.com,1753430400000,claude-code,claude-opus-4-8,3200,740, +bob@acme-corp.com,1753344060000,claude-code,claude-sonnet-4-6,800,150, +bob@acme-corp.com,1753347600000,claude-code,claude-sonnet-4-6,1100,280,autocomplete +carol@acme-corp.com,1753430460000,claude-code,claude-sonnet-4-6,200,90, +carol@acme-corp.com,1753516800000,claude-code,claude-opus-4-8,4100,980, diff --git a/docs/sources/atlassian/confluence/README.md b/docs/sources/atlassian/confluence/README.md index 2bcd5744aa..4e474cd3d9 100644 --- a/docs/sources/atlassian/confluence/README.md +++ b/docs/sources/atlassian/confluence/README.md @@ -89,7 +89,7 @@ it will print the all the values to complete the configuration: curl --request POST --url 'https://auth.atlassian.com/oauth/token' --header 'Content-Type: application/json' --data '{"grant_type": "authorization_code","client_id": "YOUR_CLIENT_ID","client_secret": "YOUR_CLIENT_SECRET", "code": "YOUR_AUTHENTICATION_CODE", "redirect_uri": "http://localhost"}'` ``` 5. After running that command, if successful you will see a - [JSON response](https://developer.atlassian.com/cloud/confluence/platform/oauth-2-3lo-apps/#2--exchange-authorization-code-for-access-token) like this: + [JSON response](https://developer.atlassian.com/cloud/confluence/oauth-2-3lo-apps/#2--exchange-authorization-code-for-access-token) like this: ```json { "access_token": "some short live access token", diff --git a/docs/sources/chatgpt-enterprise/README.md b/docs/sources/chatgpt-enterprise/README.md index e828bd70e8..15eb2a8474 100644 --- a/docs/sources/chatgpt-enterprise/README.md +++ b/docs/sources/chatgpt-enterprise/README.md @@ -2,7 +2,7 @@ **Connector ID:** `chatgpt-enterprise` -**Availability:** Alpha +**Availability:** Beta ## ChatGPT Enterprise via Compliance API diff --git a/docs/sources/gong/gong-bulk/README.md b/docs/sources/gong/gong-bulk/README.md index 5f85b73e40..484b0c466a 100644 --- a/docs/sources/gong/gong-bulk/README.md +++ b/docs/sources/gong/gong-bulk/README.md @@ -6,7 +6,7 @@ Psoxy can pseudonymize Gong bulk export data for ingestion into Worklytics. -See [https://docs.worklytics.co/knowledge-base/connectors/bulk-data/gong-bulk](https://docs.worklytics.co/knowledge-base/connectors/bulk-data/gong-bulk) +See [https://docs.worklytics.co/psoxy/sources/gong/gong-bulk](https://docs.worklytics.co/psoxy/sources/gong/gong-bulk) Data is exported from Gong's [Forecast and Gong User Tables](https://help.gong.io/docs/forecast-and-gong-user-tables) feature as CSV files that must be uploaded periodically to the proxy input bucket. diff --git a/docs/sources/google-workspace/README.md b/docs/sources/google-workspace/README.md index 90251fe2e1..2cc3e507af 100644 --- a/docs/sources/google-workspace/README.md +++ b/docs/sources/google-workspace/README.md @@ -20,6 +20,19 @@ Within those, the `google-workspace.tf` and `google-workspace-variables.tf` file - [google-chat](google-chat/README.md) (Google Chat™) - [meet](meet/README.md) (Google Meet™) +OAuth scopes omit the `https://www.googleapis.com/auth/` prefix. See [OAuth 2.0 Scopes for Google APIs](https://developers.google.com/identity/protocols/oauth2/scopes). Definitive values are defined in [`google-workspace.tf`](../../../infra/modules/worklytics-connector-specs/google-workspace.tf). + +| Connector | Connector ID | OAuth Scopes | +|-----------|--------------|--------------| +| [calendar](calendar/README.md) | `gcal` | `calendar.readonly` | +| [google-chat](google-chat/README.md) | `google-chat` | `admin.reports.audit.readonly` | +| [directory](directory/README.md) | `gdirectory` | `admin.directory.user.readonly` `admin.directory.domain.readonly` `admin.directory.group.readonly` `admin.directory.orgunit.readonly` | +| [gdrive](gdrive/README.md) | `gdrive` | `drive.metadata.readonly` | +| [gmail](gmail/README.md) | `gmail` | `gmail.metadata` | +| [meet](meet/README.md) | `google-meet` | `admin.reports.audit.readonly` | +| [gemini-in-workspace-apps](gemini-in-workspace-apps/README.md) | `gemini-in-workspace-apps` | `admin.reports.audit.readonly` | +| [gemini-usage-bulk](gemini-usage-bulk/README.md) | `gemini-usage` | n/a (bulk CSV upload) | + ## Required Permissions You (the user running Terraform) must have the following roles (or some of the permissions within them) in the GCP project in which you will provision the OAuth clients that will be used to connect to your Google Workspace™ data: @@ -38,24 +51,23 @@ Additionally, a Google Workspace™ Admin will need to make a Domain-wide De We also recommend you create a dedicated Google Workspace™ user for Psoxy to use when connecting to your Google Workspace™ Admin API, with the specific permissions needed. This avoids the connection being tied to a personal account and helps with auditing and security. -This is not to be confused with a GCP Service Account. Rather, this is a regular Google Workspace™ user account, but intended to be assigned to a service rather than a human user. Your proxy instance will impersonate this user when accessing the [Google Admin Directory](https://developers.google.com/admin-sdk/directory/v1/guides) and [Reports](https://developers.google.com/admin-sdk/reports/v1/guides) APIs. (Google requires thatthese be accessed via impersonation of a Google user account, rather than directly using a GCP service account). +This is not to be confused with a GCP Service Account. Rather, this is a regular Google Workspace™ user account, but intended to be assigned to a service rather than a human user. Your proxy instance will impersonate this user when accessing the [Google Admin Directory](https://developers.google.com/admin-sdk/directory/v1/guides) and [Reports](https://developers.google.com/workspace/admin/reports) APIs. (Google requires that these be accessed via impersonation of a Google user account, rather than directly using a GCP service account). We recommend naming the account `svc-worklytics@{your-domain.com}`. If you have already created a sufficiently privileged service account user for a different Google Workspace™ connection, you can re-use that one. -Assign the account a sufficiently privileged role. At minimum, the role must have the following privileges, _read-only_: - -- Admin API -- Domain Settings -- Groups -- Organizational Units -- Reports (required only if you are connecting to the Audit Logs, used for Google Chat™, Google Meet™, etc) -- Users +Assign the account a sufficiently privileged role. At minimum, the role must grant _read-only_ access to the following [Administrator privileges](https://knowledge.workspace.google.com/admin/users/administrator-privilege-definitions) (expand each category in the Custom Role editor and enable only the **Read** sub-action, rather than checking the parent checkbox): -Those refer to [Google's documentation](https://support.google.com/a/answer/1219251?fl=1&sjid=8026519161455224599-NA), as shown below (as of Aug 2023); you can refer there for more details about these privileges. +| Privilege | Required? | Purpose | +| --------- | --------- | ------- | +| **Users** → Read | Yes | Directory user data | +| **Groups** → Read | Yes | Directory group membership | +| **Organizational Units** → Read | Optional | Org-unit segmentation | +| **Domain Management** | Optional | List of internal domains | +| **Reports** | Only if using [Google Chat](google-chat/README.md), [Google Meet](meet/README.md), or other audit-log connectors | Audit / usage reports | -![google-workspace-admin-privileges.png](google-workspace-admin-privileges.png) +All of the above are found under **Admin settings privileges** in the Custom Role editor. Google reorganized administrator privileges in 2025; expand each category and enable only the **Read** sub-action where available. See Google's [privilege definitions](https://knowledge.workspace.google.com/admin/users/administrator-privilege-definitions) for the full list. The email address of the account you created will be used when creating the data connection to the Google Directory in the Worklytics™ portal. Provide it as the value of the 'Google Account to Use for Connection' setting when they create the connection. @@ -63,18 +75,18 @@ The email address of the account you created will be used when creating the data If you choose not to use a predefined role that covers the above, you can define a [Custom Role](https://support.google.com/a/answer/2406043?fl=1). -Using a Custom Role, with 'Read' access to each of the required Admin API privileges is good practice, but least-privilege is also enforced in TWO additional ways: +Using a Custom Role with read-only access to each required privilege is good practice, but least-privilege is also enforced in TWO additional ways: - the Proxy API rules restrict the API endpoints that Worklytics™ can access, as well as the HTTP methods that may be used. This enforces read-only access, limited to the required data types (and actually even more granular that what Workspace Admin privileges and OAuth Scopes support). - the Oauth Scopes granted to the API client via Domain-wide delegation. Each OAuth Client used by Worklytics™ is granted only read-only scopes, least-permissive for the data types required. eg `https://www.googleapis.com/auth/admin.directory.users.readonly`. So a least-privileged custom role is essentially a 3rd layer of enforcement. -In the Google Workspace™ Admin Console as of August 2023, creating a 'Custom Role' for this user will look something like the following: +An example least-privilege Custom Role for the Directory connector: -![custom-role.png](custom-role.png) +![custom-role-least-privilege.png](custom-role-least-privilege.png) -**YMMV** - Google's UI changes frequently and varies by Google Workspace™ edition, so you may see more or fewer options than shown above. Please scroll the list of privileges to ensure you grant READ access to API for all required data. +**YMMV** - Google's UI changes frequently and varies by Google Workspace™ edition, so you may see more or fewer options than shown above. Scroll the privilege list and enable only the **Read** sub-actions required for your connectors. ## General Authentication Overview @@ -84,7 +96,7 @@ When the proxy connects to Google, it first authenticates with Google API using The service account key can be rotated at any time, and the terraform configuration examples we provide can be configured to do this for you if applied regularly. -More information: https://developers.google.com/workspace/guides/auth-overview +More information: [https://developers.google.com/workspace/guides/auth-overview](https://developers.google.com/workspace/guides/auth-overview) To initially authorize each connector, a sufficiently privileged Google Workspace™ Admin must make a Domain-wide Delegation grant to the Oauth Client you create, by pasting its numeric ID and a CSV of the required OAuth Scopes into the Google Workspace™ Admin console. This is a one-time setup step. @@ -108,6 +120,18 @@ While not recommended, it is possible to set up Google API clients without Terra Then follow the steps in the next section to create the keys for the Oauth Clients. +If your organization's policies don't allow Terraform to manage some or all of these GCP resources, you can still use our Terraform modules for the rest of your deployment and disable the parts you must do manually via `google_workspace_connector_settings` in your `terraform.tfvars`: + +```hcl +google_workspace_connector_settings = { + enable_apis = false + provision_service_accounts = false + provision_keys = false +} +``` + +When any of these are `false`, Terraform will skip creating the corresponding resources and instead emit TODO files (or `todos_1` outputs, if configured) with instructions to complete those steps outside of Terraform. + NOTE: if you are creating connections to multiple Google Workspace™ sources, you can use a single OAuth client and share it between all the proxy instances. You just need to authorize the entire superset of Oauth scopes required by those connnections for the OAuth Client via the Google Workspace™ Admin console. ### Provisioning API Keys without Terraform @@ -115,9 +139,13 @@ NOTE: if you are creating connections to multiple Google Workspace™ source If your organization's policies don't allow GCP service account keys to be managed via Terraform (or you lack the perms to do so), you can still use our Terraform modules to create the clients, and just add the following to your `terraform.tfvars` to disable provisioning of the keys: ```hcl -google_workspace_provision_keys = false +google_workspace_connector_settings = { + provision_keys = false +} ``` +The deprecated top-level variable `google_workspace_provision_keys` is still supported, but the map form above is preferred. + Then you can create the keys manually, and store them in your secrets manager of choice. For each API client you need to: diff --git a/docs/sources/google-workspace/calendar/README.md b/docs/sources/google-workspace/calendar/README.md index 83cc3ab545..46fefddd76 100644 --- a/docs/sources/google-workspace/calendar/README.md +++ b/docs/sources/google-workspace/calendar/README.md @@ -5,7 +5,10 @@ **Availability:** GA Please review the [Google Workspace™ README](../README.md) for general information applicable to -all Google Workspace&trade connectors. +all Google Workspace™ connectors. + +## Required OAuth Scopes +- `calendar.readonly` ## Examples diff --git a/docs/sources/google-workspace/custom-role-least-privilege.png b/docs/sources/google-workspace/custom-role-least-privilege.png new file mode 100644 index 0000000000..3c1a083b0a Binary files /dev/null and b/docs/sources/google-workspace/custom-role-least-privilege.png differ diff --git a/docs/sources/google-workspace/custom-role.png b/docs/sources/google-workspace/custom-role.png deleted file mode 100644 index 430609b1f9..0000000000 Binary files a/docs/sources/google-workspace/custom-role.png and /dev/null differ diff --git a/docs/sources/google-workspace/directory/README.md b/docs/sources/google-workspace/directory/README.md index f500971c3e..5e2d334d60 100644 --- a/docs/sources/google-workspace/directory/README.md +++ b/docs/sources/google-workspace/directory/README.md @@ -7,6 +7,12 @@ Please review the [Google Workspace™ README](../README.md) for general information applicable to all Google Workspace connectors. +## Required OAuth Scopes +- `admin.directory.user.readonly` +- `admin.directory.domain.readonly` +- `admin.directory.group.readonly` +- `admin.directory.orgunit.readonly` + ## Examples diff --git a/docs/sources/google-workspace/example-api-calls.md b/docs/sources/google-workspace/example-api-calls.md deleted file mode 100644 index 1c90d84be4..0000000000 --- a/docs/sources/google-workspace/example-api-calls.md +++ /dev/null @@ -1,251 +0,0 @@ -# API Call Examples for Google Workspace - -Example commands (\*) that you can use to validate proxy behavior against the Google Workspace APIs. -Follow the steps and change the values to match your configuration when needed. - -You can use the `-i` flag to impersonate the desired user identity option when running the testing -tool. Example: - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gcal/calendar/v3/calendars/primary -i you@acme.com -``` - -For AWS, change the role to assume with one with sufficient permissions to call the proxy (`-r` -flag). Example: - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gcal/calendar/v3/calendars/primary -r arn:aws:iam::PROJECT_ID:role/ROLE_NAME -``` - -If any call appears to fail, repeat it using the `-v` flag. - -(\*) All commands assume that you are at the root path of the Psoxy project. - -### Calendar - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gcal/calendar/v3/calendars/primary -``` - -### Settings - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gcal/calendar/v3/users/me/settings -``` - -### Events - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gcal/calendar/v3/calendars/primary/events -``` - -### Event - -1. Get the calendar event ID (accessor path in response `.items[0].id`): - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gcal/calendar/v3/calendars/primary/events -``` - -2. Get event information (replace `calendar_event_id` with the corresponding value): - -``` -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gcal/calendar/v3/calendars/primary/events/[calendar_event_id] -``` - -## Directory - -### Domains - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdirectory/admin/directory/v1/customer/my_customer/domains -``` - -### Groups - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdirectory/admin/directory/v1/groups?customer=my_customer -``` - -### Group - -1. Get the group ID (accessor path in response `.groups[0].id`): - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdirectory/admin/directory/v1/groups?customer=my_customer -``` - -2. Get group information (replace `google_group_id` with the corresponding value): - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdirectory/admin/directory/v1/groups/[google_group_id] -``` - -### Group Members - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdirectory/admin/directory/v1/groups/[google_group_id]/members -``` - -### Users - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdirectory/admin/directory/v1/users?customer=my_customer -``` - -1. Get the user ID (accessor path in response `.users[0].id`): - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdirectory/admin/directory/v1/users?customer=my_customer -``` - -2. Get user information (replace [google_user_id] with the corresponding value): - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdirectory/admin/directory/v1/users/[google_user_id] -``` - -3. Thumbnail (expect have its contents redacted; replace [google_user_id] with the corresponding - value): - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdirectory/admin/directory/v1/users/[google_user_id]/photos/thumbnail -``` - -### Roles - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdirectory/admin/directory/v1/customer/my_customer/roles -``` - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdirectory/admin/directory/v1/customer/my_customer/roleassignments -``` - -## Drive - -### Files - -API v2 - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files -``` - -API v3 (\*) - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v3/files -``` - -(\*) Notice that only the "version" part of the URL changes, and all subsequent calls should work -for `v2` and also `v3`. - -### File - -1. Get the file ID (accessor path in response `.files[0].id`: - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files -``` - -2. Get file details (replace [drive_file_id] with the corresponding value): - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files/[drive_file_id]?fields=* -``` - -### File Revisions - -YMMV, as file at index `0` must actually be a type that supports revisions for this to return -anything. You can play with different file IDs until you find something that does. - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files/[drive_file_id]/revisions -``` - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files/[drive_file_id]/revisions?pageSize=2&fields=* -``` - -### Permissions - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files/[drive_file_id]/permissions -``` - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files/[drive_file_id]/permissions?fields=* -``` - -### Comments - -YMMV, as file at index `0` must actually be a type that has comments for this to return anything. -You can play with different file IDs until you find something that does. - -**NOTE probably blocked by OAuth metadata only scope!!** - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files/[drive_file_id]/comments -``` - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files/[drive_file_id]/comments?fields=* -``` - -### Comment - -**NOTE probably blocked by OAuth metadata only scope!!** - -1. Get file comment ID (accessor path in response `.items[0].id`): - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files/[drive_file_id]/comments -``` - -2. Get file comment details (replace `file_comment_id` with the corresponding value): - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files/[drive_file_id]/comments/[file_comment_id] -``` - -### Replies - -**NOTE probably blocked by OAuth metadata only scope!!** - -YMMV, as above, play with the file comment ID value until you find a file with comments, and a -comment that has replies. - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gdrive/drive/v2/files/[drive_file_id]/comments/[file_comment_id]/replies -``` - -## GMail - -### Messages - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gmail/gmail/v1/users/me/messages -``` - -### Message - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-gmail/gmail/v1/users/me/messages/[gmail_message_id]?format=metadata -``` - -## Google Chat - -NOTE: limited to 10 results, to keep it readable. - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-google-chat/admin/reports/v1/activity/users/all/applications/chat?maxResults=10 -``` - -## Google Meet - -NOTE: limited to 10 results, to keep it readable. - -```shell -node tools/psoxy-test/cli-call.js -u [your_psoxy_url]/psoxy-google-chat/admin/reports/v1/activity/users/all/applications/meet?maxResults=10 -``` diff --git a/docs/sources/google-workspace/gdrive/README.md b/docs/sources/google-workspace/gdrive/README.md index 6058699db8..269f7cafdc 100644 --- a/docs/sources/google-workspace/gdrive/README.md +++ b/docs/sources/google-workspace/gdrive/README.md @@ -7,6 +7,9 @@ Please review the [Google Workspace™ README](../README.md) for general information applicable to all Google Workspace connectors. +## Required OAuth Scopes +- `drive.metadata.readonly` + ## Examples diff --git a/docs/sources/google-workspace/gemini-in-workspace-apps/README.md b/docs/sources/google-workspace/gemini-in-workspace-apps/README.md index 752024bfaa..5cde8b326b 100644 --- a/docs/sources/google-workspace/gemini-in-workspace-apps/README.md +++ b/docs/sources/google-workspace/gemini-in-workspace-apps/README.md @@ -9,6 +9,9 @@ all Google Workspace connectors. This connector pulls Gemini-events from the Google Workspace audit log. +## Required OAuth Scopes +- `admin.reports.audit.readonly` + ## Examples diff --git a/docs/sources/google-workspace/gemini-usage-bulk/README.md b/docs/sources/google-workspace/gemini-usage-bulk/README.md index fec37eede5..ccb6357fe7 100644 --- a/docs/sources/google-workspace/gemini-usage-bulk/README.md +++ b/docs/sources/google-workspace/gemini-usage-bulk/README.md @@ -4,13 +4,17 @@ **Availability:** Deprecated +## Required OAuth Scopes + +None. This is a bulk connector; reports are downloaded from the Google Workspace Admin Console and uploaded to the proxy rather than retrieved via OAuth-scoped APIs. + Worklytics™ supports the import of Gemini™ Usage reports to analyze AI adoption in your organization. As of Feb 2025, these reports must be downloaded periodically by a sufficiently privileged user from the Google Workspace™ Admin Console. The reports cover ~4 weeks of history; we recommend downloading them at least weekly to provide granular insights into AI adoption. More information: -https://support.google.com/a/answer/14564320 +[https://support.google.com/a/answer/14564320](https://support.google.com/a/answer/14564320) The CSV report file must then be uploaded to a proxy `-input` bucket for your connector, which will then be processed by the pseudonymization proxy to prepare it for import to Worklytics™. diff --git a/docs/sources/google-workspace/gmail/README.md b/docs/sources/google-workspace/gmail/README.md index a640f0c4ed..9d1169d3e8 100644 --- a/docs/sources/google-workspace/gmail/README.md +++ b/docs/sources/google-workspace/gmail/README.md @@ -7,6 +7,9 @@ Please review the [Google Workspace™ README](../README.md) for general information applicable to all Google Workspace connectors. +## Required OAuth Scopes +- `gmail.metadata` + ## Examples - [Example Rules](gmail.yaml) diff --git a/docs/sources/google-workspace/google-chat/README.md b/docs/sources/google-workspace/google-chat/README.md index 2a9441aa27..422d072f39 100644 --- a/docs/sources/google-workspace/google-chat/README.md +++ b/docs/sources/google-workspace/google-chat/README.md @@ -7,6 +7,9 @@ Please review the [Google Workspace™ README](../README.md) for general information applicable to all Google Workspace connectors. +## Required OAuth Scopes +- `admin.reports.audit.readonly` + ## Examples diff --git a/docs/sources/google-workspace/google-workspace-admin-privileges.png b/docs/sources/google-workspace/google-workspace-admin-privileges.png deleted file mode 100644 index 66722094d8..0000000000 Binary files a/docs/sources/google-workspace/google-workspace-admin-privileges.png and /dev/null differ diff --git a/docs/sources/google-workspace/meet/README.md b/docs/sources/google-workspace/meet/README.md index dc12e4c564..0e98952885 100644 --- a/docs/sources/google-workspace/meet/README.md +++ b/docs/sources/google-workspace/meet/README.md @@ -7,6 +7,9 @@ Please review the [Google Workspace™ README](../README.md) for general information applicable to all Google Workspace connectors. +## Required OAuth Scopes +- `admin.reports.audit.readonly` + ## Examples diff --git a/docs/sources/microsoft-365/README.md b/docs/sources/microsoft-365/README.md index 78a34b04c3..a06f395c24 100644 --- a/docs/sources/microsoft-365/README.md +++ b/docs/sources/microsoft-365/README.md @@ -89,7 +89,7 @@ For example, pilot/PoC deployments typically use only the `Calendar` connector i | Source                 | Examples    | Application Scopes | |--------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Entra ID | [data](https://github.com/Worklytics/psoxy/tree/main/docs/sources/microsoft-365/directory/example-api-responses) - [rules](https://github.com/Worklytics/psoxy/tree/main/docs/sources/microsoft-365/directory/directory.yaml) | [`User.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#userreadall) [`Group.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#groupreadall) [`MailboxSettings.Read`](https://learn.microsoft.com/en-us/graph/permissions-reference#mailboxsettingsread) | +| Entra ID | [data](https://github.com/Worklytics/psoxy/tree/main/docs/sources/microsoft-365/entra-id/example-api-responses) - [rules](https://github.com/Worklytics/psoxy/tree/main/docs/sources/microsoft-365/entra-id/entra-id.yaml) | [`User.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#userreadall) [`Group.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#groupreadall) [`MailboxSettings.Read`](https://learn.microsoft.com/en-us/graph/permissions-reference#mailboxsettingsread) | | Calendar | [data](https://github.com/Worklytics/psoxy/tree/main/docs/sources/microsoft-365/outlook-cal/example-api-responses) - [rules](https://github.com/Worklytics/psoxy/tree/main/docs/sources/microsoft-365/outlook-cal/outlook-cal.yaml) | [`User.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#userreadall) [`Group.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#groupreadall) [`Calendars.Read`](https://learn.microsoft.com/en-us/graph/permissions-reference#calendarsread) [`MailboxSettings.Read`](https://learn.microsoft.com/en-us/graph/permissions-reference#mailboxsettingsread) | | Mail | [data](https://github.com/Worklytics/psoxy/tree/main/docs/sources/microsoft-365/outlook-mail/example-api-responses) - [rules](https://github.com/Worklytics/psoxy/tree/main/docs/sources/microsoft-365/outlook-mail/outlook-mail.yaml) | [`User.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#userreadall) [`Group.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#groupreadall) [`Mail.ReadBasic.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#mailreadbasicall) [`MailboxSettings.Read`](https://learn.microsoft.com/en-us/graph/permissions-reference#mailboxsettingsread) | | Teams | [data](https://github.com/Worklytics/psoxy/tree/main/docs/sources/microsoft-365/msft-teams/example-api-responses) - [rules](https://github.com/Worklytics/psoxy/tree/main/docs/sources/microsoft-365/msft-teams/msft-teams.yaml) | [`User.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#userreadall) [`Team.ReadBasic.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#teamreadbasicall) [`Channel.ReadBasic.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#channelreadbasicall) [`Chat.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#chatreadall) [`ChannelMessage.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#channelmessagereadall) [`CallRecords.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#channelmessagereadall) [`OnlineMeetings.Read.All`](https://learn.microsoft.com/en-us/graph/permissions-reference#onlinemeetingsreadall) | @@ -121,7 +121,7 @@ If you lack the `Cloud Application Administrator` role, you can ask someone in y Then you obtain the `Object ID` of the Entra ID application you created, and set it as the value of `msft_connector_app_object_id` in your `terraform.tfvars` file. See: -https://github.com/Worklytics/psoxy-example-aws/blob/main/msft-365-variables.tf +[https://github.com/Worklytics/psoxy-example-aws/blob/main/msft-365-variables.tf](https://github.com/Worklytics/psoxy-example-aws/blob/main/msft-365-variables.tf) ### Configure Workload Identity Federation (OIDC) Authentication via Entra ID @@ -173,7 +173,7 @@ the settings necessary for the proxy to connect to Microsoft Graph API. After th still need a Microsoft 365 admin to perform the admin consent step for each application. See -https://registry.terraform.io/providers/hashicorp/azuread/latest/docs/resources/application#import +[https://registry.terraform.io/providers/hashicorp/azuread/latest/docs/resources/application#import](https://registry.terraform.io/providers/hashicorp/azuread/latest/docs/resources/application#import) for details. diff --git a/docs/sources/microsoft-365/example-api-calls.md b/docs/sources/microsoft-365/example-api-calls.md deleted file mode 100644 index 1f2671e935..0000000000 --- a/docs/sources/microsoft-365/example-api-calls.md +++ /dev/null @@ -1,15 +0,0 @@ -# API Call Examples - -Example test commands that you can use to validate proxy behavior against various source APIs. - - -## Mail - -Assuming proxy is auth'd as an application, you'll have to replace `me` with your MSFT ID or -UserPrincipalName (often your email address). - -``` -/v1.0/users/me/mailFolders/SentItems/messages -/v1.0/users/me/messages/{messageId} -/v1.0/users/me/mailboxSettings -``` diff --git a/docs/sources/microsoft-365/msft-teams/README.md b/docs/sources/microsoft-365/msft-teams/README.md index c1a25ccc0d..35067e66ab 100644 --- a/docs/sources/microsoft-365/msft-teams/README.md +++ b/docs/sources/microsoft-365/msft-teams/README.md @@ -33,7 +33,7 @@ Please follow the steps below: 1. Ensure the user you are going to use for running the commands has the "Teams Administrator" role. You can add the role in the [Microsoft 365 Admin Center](https://learn.microsoft.com/en-us/microsoft-365/admin/add-users/assign-admin-roles?view=o365-worldwide#assign-a-user-to-an-admin-role-from-active-users) -**NOTE**: It can be assigned through Entra Id portal in Azure portal OR in Entra Admin center https://admin.microsoft.com/AdminPortal/Home. It is possible that even login with an admin account in Entra Admin Center the Teams role is not available to assign to any user; if so, please do it through Azure Portal (Entra Id -> Users -> Assign roles) +**NOTE**: It can be assigned through Entra Id portal in Azure portal OR in [https://admin.microsoft.com/AdminPortal/Home](https://admin.microsoft.com/AdminPortal/Home). It is possible that even login with an admin account in Entra Admin Center the Teams role is not available to assign to any user; if so, please do it through Azure Portal (Entra Id -> Users -> Assign roles) 2. Install [PowerShell Teams](https://learn.microsoft.com/en-us/microsoftteams/teams-powershell-install) module. 3. Run the following commands in Powershell terminal: diff --git a/docs/sources/miro/miro-ai-bulk/README.md b/docs/sources/miro/miro-ai-bulk/README.md index fee594c67e..e35f156f41 100644 --- a/docs/sources/miro/miro-ai-bulk/README.md +++ b/docs/sources/miro/miro-ai-bulk/README.md @@ -6,7 +6,7 @@ Psoxy can pseudonymize Miro AI from [Miro Audit Log CSV](https://help.miro.com/hc/en-us/articles/360017571434-Audit-logs#h_01J7EY4E0F67EFTRQ7BT688HW0) data for ingestion into Worklytics. -See [https://docs.worklytics.co/knowledge-base/connectors/bulk-data/miro-ai-bulk](https://docs.worklytics.co/knowledge-base/connectors/bulk-data/miro-ai-bulk) +See [https://docs.worklytics.co/psoxy/sources/miro/miro-ai-bulk](https://docs.worklytics.co/psoxy/sources/miro/miro-ai-bulk) The default proxy rules for `miro-ai-bulk` will pseudonymize `Actor` and `Team Name`. Fields like `IP Address`, `Actor Name` and `Affected Object` will be redacted. If your data set does not match diff --git a/docs/sources/salesforce/sales-for-copilot/README.md b/docs/sources/salesforce/sales-for-copilot/README.md new file mode 100644 index 0000000000..af764753ee --- /dev/null +++ b/docs/sources/salesforce/sales-for-copilot/README.md @@ -0,0 +1,54 @@ +# Salesforce Copilot for Sales Bulk + +**Connector ID:** `sales-for-copilot` + +**Availability:** Beta + +Psoxy can pseudonymize Salesforce Copilot for Sales usage events exported as CSV (or NDJSON) bulk files for ingestion into Worklytics. Each row captures one usage event — user, model, and token consumption. + +## Instructions to Connect + +1. Copy [sales-for-copilot-rules.yaml](sales-for-copilot-rules.yaml) to your Terraform working directory. +2. Add a `sales-for-copilot` entry to your `custom_bulk_connectors` in `terraform.tfvars`, then run `terraform apply`. +3. Review your Terraform output; find the `-input` bucket name for the connector. +4. Export Copilot for Sales usage data from your system as CSV. See [sales-for-copilot-sample.csv](sales-for-copilot-sample.csv) for the expected shape. +5. Upload files to the `-input` bucket. Include a date/timestamp in the filename (e.g. `sales-for-copilot-20260101T000000Z.csv`) so repeated uploads are stored as separate objects. +6. Create a **Bulk Import - Psoxy** connection in Worklytics with `sales-for-copilot-bulk` as the parser; see the TODO file generated by `terraform apply`. +7. Repeat steps 4–5 periodically to keep data up to date. + +```hcl +custom_bulk_connectors = { + "sales-for-copilot" = { + source_kind = "sales-for-copilot" + worklytics_connector_id = "bulk-import-psoxy" + worklytics_connector_name = "Bulk Import - Psoxy" + display_name = "Sales for Copilot Bulk" + rules_file = "sales-for-copilot-rules.yaml" + settings_to_provide = { + "Parser" = "sales-for-copilot-bulk" + } + } +} +``` + +## File Schema + +| Field | Type | Required | Description | +|-----------------|---------|----------|-------------| +| `userEmail` | string | yes | Email of the user who triggered the event. Pseudonymized by Psoxy. | +| `timestamp` | integer | yes | Event time as Unix epoch milliseconds. | +| `ai_tool` | string | no | AI tool identifier (e.g. `copilot`). | +| `model` | string | no | Underlying model name (e.g. `gpt-4o`). | +| `input_tokens` | integer | no | Input/prompt tokens consumed. | +| `output_tokens` | integer | no | Output/completion tokens generated. | +| `event_type` | string | no | Type of event, if available (open enum, e.g. `email_draft`). | + +## Sanitization + +Only `userEmail` contains PII. All other fields are passed through unchanged. + +See [sales-for-copilot-rules.yaml](sales-for-copilot-rules.yaml) for the full rule set. + +## Example Data + +- [sales-for-copilot-sample.csv](sales-for-copilot-sample.csv) diff --git a/docs/sources/salesforce/sales-for-copilot/sales-for-copilot-rules.yaml b/docs/sources/salesforce/sales-for-copilot/sales-for-copilot-rules.yaml new file mode 100644 index 0000000000..5e4839490d --- /dev/null +++ b/docs/sources/salesforce/sales-for-copilot/sales-for-copilot-rules.yaml @@ -0,0 +1,3 @@ +format: AUTO +transforms: + - pseudonymize: "$.userEmail" diff --git a/docs/sources/salesforce/sales-for-copilot/sales-for-copilot-sample.csv b/docs/sources/salesforce/sales-for-copilot/sales-for-copilot-sample.csv new file mode 100644 index 0000000000..09d76f5306 --- /dev/null +++ b/docs/sources/salesforce/sales-for-copilot/sales-for-copilot-sample.csv @@ -0,0 +1,8 @@ +userEmail,timestamp,ai_tool,model,input_tokens,output_tokens,event_type +alice@acme-corp.com,1753344000000,copilot,gpt-4o,1200,280, +alice@acme-corp.com,1753344900000,copilot,gpt-4o,950,210,email_draft +alice@acme-corp.com,1753430400000,copilot,gpt-4o,1800,420,meeting_summary +bob@acme-corp.com,1753344060000,copilot,gpt-4o-mini,600,120, +bob@acme-corp.com,1753347600000,copilot,gpt-4o,1350,310,email_draft +carol@acme-corp.com,1753430460000,copilot,gpt-4o-mini,450,95, +carol@acme-corp.com,1753516800000,copilot,gpt-4o,2100,530,meeting_summary diff --git a/docs/sources/slack/slack-analytics/README.md b/docs/sources/slack/slack-analytics/README.md index 53bca6e3e8..464710366b 100644 --- a/docs/sources/slack/slack-analytics/README.md +++ b/docs/sources/slack/slack-analytics/README.md @@ -48,7 +48,7 @@ All requests are proxied to `https://www.slack.com/api/…` with a **User OAuth For enabling Slack Analytics with the Psoxy you must first create an app on your Slack Enterprise Grid organization. -1. Go to https://api.slack.com/apps and create an app. +1. Go to [https://api.slack.com/apps](https://api.slack.com/apps) and create an app. - Select "From scratch", choose a name (for example "Worklytics connector") and a development workspace. ![](../slack-discovery-api/img/slack-step-1.png) diff --git a/docs/sources/slack/slack-discovery-api/README.md b/docs/sources/slack/slack-discovery-api/README.md index 79c528309a..3d9cffc676 100644 --- a/docs/sources/slack/slack-discovery-api/README.md +++ b/docs/sources/slack/slack-discovery-api/README.md @@ -30,7 +30,7 @@ of the [Psoxy repository](https://github.com/Worklytics/psoxy). For enabling Slack via Discovery API with the Psoxy you must first set up an app on your Slack Enterprise instance. -1. Go to https://api.slack.com/apps and create an app. +1. Go to [https://api.slack.com/apps](https://api.slack.com/apps) and create an app. - Select "From scratch", choose a name (for example "Worklytics connector") and a development workspace ![](./img/slack-step-1.png) diff --git a/docs/sources/slack/slack-discovery-bulk/discovery-bulk-auto.yaml b/docs/sources/slack/slack-discovery-bulk/discovery-bulk-auto.yaml index a4af54fd96..3aacac25a6 100644 --- a/docs/sources/slack/slack-discovery-bulk/discovery-bulk-auto.yaml +++ b/docs/sources/slack/slack-discovery-bulk/discovery-bulk-auto.yaml @@ -1,5 +1,5 @@ # Auto-detect format rules for Slack Discovery Bulk -# Works with NDJSON, JSON Array, and Parquet files +# Works with NDJSON, JSON Lines (.jsonl), JSON Array, and Parquet files # Use prefix patterns (e.g., users*, channels*) to match files regardless of extension fileRules: diff --git a/docs/sources/workdata-generic/workdata-generic.yaml b/docs/sources/workdata-generic/workdata-generic.yaml index 6689239cc8..b8984df6d0 100644 --- a/docs/sources/workdata-generic/workdata-generic.yaml +++ b/docs/sources/workdata-generic/workdata-generic.yaml @@ -51,4 +51,3 @@ fileRules: - pseudonymize: - "$.accountId" - "$.associatedIdentities[*].id" - diff --git a/infra/examples-dev/aws/google-workspace-variables.tf b/infra/examples-dev/aws/google-workspace-variables.tf index 8ceca1f169..cf63f376b4 100644 --- a/infra/examples-dev/aws/google-workspace-variables.tf +++ b/infra/examples-dev/aws/google-workspace-variables.tf @@ -80,6 +80,6 @@ locals { variable "google_workspace_connector_settings" { type = map(any) - description = "Map of configuration settings specifically for Google Workspace connectors (e.g. example users). Note that provider-controlling parameters (like GCP project IDs or impersonation SAs) remain top-level variables." + description = "Map of configuration settings specifically for Google Workspace connectors. Supported keys: example_user, example_admin, provision_keys, key_rotation_days, provision_service_accounts, enable_apis. Provider-controlling parameters (like GCP project IDs or impersonation SAs) remain top-level variables." default = {} } diff --git a/infra/examples-dev/aws/google-workspace.tf b/infra/examples-dev/aws/google-workspace.tf index dd1b8ac794..9ad51be685 100644 --- a/infra/examples-dev/aws/google-workspace.tf +++ b/infra/examples-dev/aws/google-workspace.tf @@ -8,7 +8,7 @@ provider "google" { module "worklytics_connectors_google_workspace" { source = "../../modules/worklytics-connectors-google-workspace" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors-google-workspace?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors-google-workspace?ref=v0.6.7" google_workspace_connector_settings = var.google_workspace_connector_settings diff --git a/infra/examples-dev/aws/main.tf b/infra/examples-dev/aws/main.tf index 9534f5b9ab..b9deb268a3 100644 --- a/infra/examples-dev/aws/main.tf +++ b/infra/examples-dev/aws/main.tf @@ -21,7 +21,7 @@ terraform { # general cases module "worklytics_connectors" { source = "../../modules/worklytics-connectors" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors?ref=v0.6.7" enabled_connectors = var.enabled_connectors connector_settings = var.connector_settings @@ -121,7 +121,7 @@ locals { module "psoxy" { source = "../../modules/aws-host" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/aws-host?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/aws-host?ref=v0.6.7" environment_name = var.environment_name aws_account_id = var.aws_account_id @@ -200,7 +200,7 @@ module "connection_in_worklytics" { for_each = local.all_instances source = "../../modules/worklytics-proxy-connection-aws" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-proxy-connection-aws?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-proxy-connection-aws?ref=v0.6.7" proxy_instance_id = each.key worklytics_host = var.worklytics_host diff --git a/infra/examples-dev/aws/msft-365.tf b/infra/examples-dev/aws/msft-365.tf index 6250de2c14..fc77977e39 100644 --- a/infra/examples-dev/aws/msft-365.tf +++ b/infra/examples-dev/aws/msft-365.tf @@ -2,7 +2,7 @@ module "worklytics_connectors_msft_365" { source = "../../modules/worklytics-connectors-msft-365" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors-msft-365?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors-msft-365?ref=v0.6.7" msft_365_connector_settings = var.msft_365_connector_settings @@ -52,7 +52,7 @@ module "cognito_identity_pool" { count = local.msft_365_enabled ? 1 : 0 # only provision identity pool if MSFT-365 connectors are enabled source = "../../modules/aws-cognito-pool" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/aws-cognito-pool?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/aws-cognito-pool?ref=v0.6.7" developer_provider_name = local.developer_provider_name name = "${local.env_qualifier}-azure-ad-federation" @@ -75,7 +75,7 @@ module "cognito_identity" { count = local.msft_365_enabled ? 1 : 0 # only provision identity pool if MSFT-365 connectors are enabled source = "../../modules/aws-cognito-identity-cli" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/aws-cognito-identity-cli?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/aws-cognito-identity-cli?ref=v0.6.7" aws_region = data.aws_region.current.region @@ -113,7 +113,7 @@ module "msft_connection_auth_federation" { for_each = local.provision_entraid_apps ? local.enabled_to_entraid_object : local.shared_to_entraid_object source = "../../modules/azuread-federated-credentials" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/azuread-federated-credentials?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/azuread-federated-credentials?ref=v0.6.7" application_id = each.value.connector_id display_name = "${local.env_qualifier}AccessFromAWS" diff --git a/infra/examples-dev/gcp/google-workspace-variables.tf b/infra/examples-dev/gcp/google-workspace-variables.tf index 8ceca1f169..cf63f376b4 100644 --- a/infra/examples-dev/gcp/google-workspace-variables.tf +++ b/infra/examples-dev/gcp/google-workspace-variables.tf @@ -80,6 +80,6 @@ locals { variable "google_workspace_connector_settings" { type = map(any) - description = "Map of configuration settings specifically for Google Workspace connectors (e.g. example users). Note that provider-controlling parameters (like GCP project IDs or impersonation SAs) remain top-level variables." + description = "Map of configuration settings specifically for Google Workspace connectors. Supported keys: example_user, example_admin, provision_keys, key_rotation_days, provision_service_accounts, enable_apis. Provider-controlling parameters (like GCP project IDs or impersonation SAs) remain top-level variables." default = {} } diff --git a/infra/examples-dev/gcp/google-workspace.tf b/infra/examples-dev/gcp/google-workspace.tf index dd1b8ac794..9ad51be685 100644 --- a/infra/examples-dev/gcp/google-workspace.tf +++ b/infra/examples-dev/gcp/google-workspace.tf @@ -8,7 +8,7 @@ provider "google" { module "worklytics_connectors_google_workspace" { source = "../../modules/worklytics-connectors-google-workspace" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors-google-workspace?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors-google-workspace?ref=v0.6.7" google_workspace_connector_settings = var.google_workspace_connector_settings diff --git a/infra/examples-dev/gcp/main.tf b/infra/examples-dev/gcp/main.tf index 7015635a67..227feb0cb3 100644 --- a/infra/examples-dev/gcp/main.tf +++ b/infra/examples-dev/gcp/main.tf @@ -30,7 +30,7 @@ locals { # call this 'generic_source_connectors'? module "worklytics_connectors" { source = "../../modules/worklytics-connectors" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors?ref=v0.6.7" base_dir = var.psoxy_base_dir enabled_connectors = var.enabled_connectors @@ -104,7 +104,7 @@ locals { module "psoxy" { source = "../../modules/gcp-host" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/gcp-host?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/gcp-host?ref=v0.6.7" gcp_project_id = var.gcp_project_id environment_name = var.environment_name @@ -168,7 +168,7 @@ module "connection_in_worklytics" { for_each = local.all_instances source = "../../modules/worklytics-proxy-connection-generic" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-proxy-connection-generic?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-proxy-connection-generic?ref=v0.6.7" host_platform_id = local.host_platform_id proxy_instance_id = each.key diff --git a/infra/examples-dev/gcp/msft-365.tf b/infra/examples-dev/gcp/msft-365.tf index 305e21915d..407a631074 100644 --- a/infra/examples-dev/gcp/msft-365.tf +++ b/infra/examples-dev/gcp/msft-365.tf @@ -2,7 +2,7 @@ module "worklytics_connectors_msft_365" { source = "../../modules/worklytics-connectors-msft-365" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors-msft-365?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/worklytics-connectors-msft-365?ref=v0.6.7" msft_365_connector_settings = var.msft_365_connector_settings @@ -37,7 +37,7 @@ module "msft-connection-auth-federation" { for_each = module.worklytics_connectors_msft_365.enabled_api_connectors source = "../../modules/azuread-federated-credentials" - # source = "git::https://github.com/worklytics/psoxy//infra/modules/azuread-federated-credentials?ref=v0.6.6" + # source = "git::https://github.com/worklytics/psoxy//infra/modules/azuread-federated-credentials?ref=v0.6.7" application_id = each.value.connector.id display_name = "GcpFederation" diff --git a/infra/modules/aws-host/main.tf b/infra/modules/aws-host/main.tf index a9470b59ca..3b3bfa31a3 100644 --- a/infra/modules/aws-host/main.tf +++ b/infra/modules/aws-host/main.tf @@ -110,6 +110,7 @@ module "psoxy" { enable_webhook_testing = local.enable_webhook_testing webhook_allow_origins = distinct(flatten([for v in var.webhook_collectors : v.allow_origins])) artifacts_bucket_name = var.artifacts_bucket_name + enable_remote_resources = var.enable_remote_resources allowed_data_access_ip_blocks = var.allowed_data_access_ip_blocks allowed_webhook_ip_blocks = var.allowed_webhook_ip_blocks } @@ -127,6 +128,7 @@ locals { connector_instance_resource_path = { for k, v in merge(var.api_connectors, var.bulk_connectors, var.webhook_collectors) : k => "${local.shared_resource_path}${replace(upper(k), "-", "_")}/" } + remote_resources_enabled = var.enable_remote_resources && module.psoxy.artifacts_bucket_name != null # convert custom_side_outputs to the format expected by the psoxy module custom_original_side_outputs = { for k, v in var.custom_side_outputs : @@ -261,9 +263,9 @@ module "api_connector" { var.general_environment_variables, ) - remote_resource_bucket = var.enable_remote_resources ? module.psoxy.artifacts_bucket_name : null - remote_resource_instance_path = var.enable_remote_resources ? local.connector_instance_resource_path[each.key] : null - remote_resource_shared_path = var.enable_remote_resources ? local.shared_resource_path : null + remote_resource_bucket = local.remote_resources_enabled ? module.psoxy.artifacts_bucket_name : null + remote_resource_instance_path = local.remote_resources_enabled ? local.connector_instance_resource_path[each.key] : null + remote_resource_shared_path = local.remote_resources_enabled ? local.shared_resource_path : null } @@ -349,9 +351,9 @@ module "bulk_connector" { var.general_environment_variables ) - remote_resource_bucket = var.enable_remote_resources ? module.psoxy.artifacts_bucket_name : null - remote_resource_instance_path = var.enable_remote_resources ? local.connector_instance_resource_path[each.key] : null - remote_resource_shared_path = var.enable_remote_resources ? local.shared_resource_path : null + remote_resource_bucket = local.remote_resources_enabled ? module.psoxy.artifacts_bucket_name : null + remote_resource_instance_path = local.remote_resources_enabled ? local.connector_instance_resource_path[each.key] : null + remote_resource_shared_path = local.remote_resources_enabled ? local.shared_resource_path : null } @@ -401,9 +403,9 @@ module "webhook_collectors" { var.general_environment_variables, ) - remote_resource_bucket = var.enable_remote_resources ? module.psoxy.artifacts_bucket_name : null - remote_resource_instance_path = var.enable_remote_resources ? local.connector_instance_resource_path[each.key] : null - remote_resource_shared_path = var.enable_remote_resources ? local.shared_resource_path : null + remote_resource_bucket = local.remote_resources_enabled ? module.psoxy.artifacts_bucket_name : null + remote_resource_instance_path = local.remote_resources_enabled ? local.connector_instance_resource_path[each.key] : null + remote_resource_shared_path = local.remote_resources_enabled ? local.shared_resource_path : null } # Policy to allow test caller to invoke webhook collector urls and sign webhook requests @@ -493,7 +495,6 @@ locals { caller_has_configured_output_buckets = ( length(var.bulk_connectors) > 0 || length(var.webhook_collectors) > 0 || - length(var.lookup_table_builders) > 0 || length([for k, v in var.api_connectors : k if try(v.enable_async_processing, false)]) > 0 || length([for k, v in local.sanitized_side_outputs : k if v != null]) > 0 ) @@ -503,7 +504,6 @@ locals { [for k, v in module.webhook_collectors : v.output_sanitized_bucket_id], [for k, v in module.api_connector : v.async_output_bucket_id if try(v.async_output_bucket_id, null) != null], [for k, v in module.api_connector : v.side_output_sanitized_bucket_id if try(v.side_output_sanitized_bucket_id, null) != null], - [for k, v in module.lookup_output : v.output_bucket], ))) caller_output_bucket_read_resources = flatten([ diff --git a/infra/modules/aws-host/variables.tf b/infra/modules/aws-host/variables.tf index d9c9e1d5bd..09763b3097 100644 --- a/infra/modules/aws-host/variables.tf +++ b/infra/modules/aws-host/variables.tf @@ -468,13 +468,12 @@ variable "todo_step" { variable "artifacts_bucket_name" { type = string - description = "Name of an existing S3 bucket to use for deployment artifacts. If null, one will be provisioned if needed." + description = "Name of an existing S3 bucket to use for deployment artifacts and remote resources (rules, NLP models, etc.). If null, one will be provisioned when needed for a local deployment bundle or when enable_remote_resources is true." default = null } - variable "enable_remote_resources" { type = bool - description = "**beta** Whether to enable remote resource loading from the artifacts S3 bucket (rules, NLP models, etc.). When true, sets REMOTE_RESOURCE_BUCKET env var and grants s3:GetObject to each Lambda. Default will change to `true` in next major version." - default = false # will change to true in 0.7.x + description = "**beta** Whether to enable remote resource loading from the artifacts S3 bucket (rules, NLP models, etc.). When true, sets REMOTE_RESOURCE_BUCKET env var and grants s3:GetObject to each Lambda. Provisions an artifacts bucket if one is not already created or provided." + default = false } diff --git a/infra/modules/aws-proxy-api/main.tf b/infra/modules/aws-proxy-api/main.tf index 2f4493041a..f6634493ef 100644 --- a/infra/modules/aws-proxy-api/main.tf +++ b/infra/modules/aws-proxy-api/main.tf @@ -341,14 +341,37 @@ locals { "${request.method} ${request.path}" => join("", [for name, value in try(request.headers, {}) : " -H \"${name}: ${value}\""]) } + # Shell script positional args; omit empty content-type/body/header slots unless needed + example_api_script_invocation = { for request in local.all_example_api_requests : + "${request.method} ${request.path}" => join(" ", concat( + [request.method, "'${replace(request.path, "'", "'\\''")}'"], + request.body != null ? [ + coalesce(request.content_type, "application/json"), + "'${replace(request.body, "\"", "\\\"")}'" + ] : [], + (request.body == null && trimspace(lookup(local.example_request_header_flags, "${request.method} ${request.path}", "")) != "") ? ["''", "''"] : [], + trimspace(lookup(local.example_request_header_flags, "${request.method} ${request.path}", "")) != "" ? [ + "'${replace(trimspace(lookup(local.example_request_header_flags, "${request.method} ${request.path}", "")), "'", "'\\''")}'" + ] : [] + )) + } + example_api_get_requests_for_script = [for r in local.example_api_get_requests : merge(r, { - header_flags = trimspace(lookup(local.example_request_header_flags, "${r.method} ${r.path}", "")) + header_flags = trimspace(lookup(local.example_request_header_flags, "${r.method} ${r.path}", "")) + script_invocation = local.example_api_script_invocation["${r.method} ${r.path}"] })] example_api_post_requests_for_script = [for r in local.all_example_api_requests : merge(r, { - header_flags = trimspace(lookup(local.example_request_header_flags, "${r.method} ${r.path}", "")) + header_flags = trimspace(lookup(local.example_request_header_flags, "${r.method} ${r.path}", "")) + script_invocation = local.example_api_script_invocation["${r.method} ${r.path}"] }) if r.method == "POST" && r.body != null] + example_api_script_invocations = concat( + [for r in local.example_api_get_requests_for_script : r.script_invocation], + [for r in local.example_api_post_requests_for_script : r.script_invocation] + ) + + test_script_filename = "test-${var.instance_id}.sh" default_header_flags = length(local.example_api_get_requests_for_script) > 0 ? local.example_api_get_requests_for_script[0].header_flags : "" # Generate test calls from all example requests @@ -406,6 +429,20 @@ ${join("\n", local.command_test_calls)} Feel free to try the above calls, and reference to the source's API docs for other parameters / endpoints to experiment with. +### Or use the test script + +We also generated `${local.test_script_filename}`, a wrapper script around the test tool. +Run it with no arguments to exercise the default example endpoint, or pass arguments to try +others: + +```shell +./${local.test_script_filename} +``` + +```shell +${join("\n", [for invocation in local.example_api_script_invocations : "./${local.test_script_filename} ${invocation}"])} +``` + As an alternative, we offer a simpler bash script for testing that wraps `awscurl` + `jq`, if those are installed on your system: diff --git a/infra/modules/aws-proxy-api/test_script.tftpl b/infra/modules/aws-proxy-api/test_script.tftpl index bc2b7e73c4..558baaff0d 100644 --- a/infra/modules/aws-proxy-api/test_script.tftpl +++ b/infra/modules/aws-proxy-api/test_script.tftpl @@ -27,10 +27,10 @@ ASYNC_CALL_RC=0 echo "Invoke this script with any of the following as arguments to test other endpoints:" %{ for example_api_get_request in example_api_get_requests ~} - printf "\t%s\n" "${example_api_get_request.method} '${example_api_get_request.path}' '' '' '${replace(example_api_get_request.header_flags, "'", "'\\''")}'" + printf "\t%s\n" "${example_api_get_request.script_invocation}" %{ endfor ~} %{ for example_api_post_request in example_api_post_requests ~} - printf "\t%s\n" "${example_api_post_request.method} '${example_api_post_request.path}' ${example_api_post_request.content_type} '${replace(example_api_post_request.body, "\"", "\\\"")}' '${replace(example_api_post_request.header_flags, "'", "'\\''")}'" + printf "\t%s\n" "${example_api_post_request.script_invocation}" %{ endfor ~} exit $(( HEALTHCHECK_RC + SYNC_CALL_RC + ASYNC_CALL_RC )) diff --git a/infra/modules/aws-proxy-bulk/provision_testing_infra.tftest.hcl b/infra/modules/aws-proxy-bulk/provision_testing_infra.tftest.hcl index 3d44417c80..0d8961c67e 100644 --- a/infra/modules/aws-proxy-bulk/provision_testing_infra.tftest.hcl +++ b/infra/modules/aws-proxy-bulk/provision_testing_infra.tftest.hcl @@ -1,14 +1,14 @@ # Plan succeeds with testing IAM/bucket policies enabled and disabled. variables { - environment_name = "test" - instance_id = "hris" - aws_account_id = "123456789012" - path_to_function_zip = "../aws-proxy-lambda/tests/deployment.zip" - function_zip_hash = "dummy-hash-for-test" - path_to_instance_ssm_parameters = "PSOXY_TEST_HRIS_" - provision_iam_policy_for_testing = false - aws_principal_arn_when_testing = null + environment_name = "test" + instance_id = "hris" + aws_account_id = "123456789012" + path_to_function_zip = "../aws-proxy-lambda/tests/deployment.zip" + function_zip_hash = "dummy-hash-for-test" + path_to_instance_ssm_parameters = "PSOXY_TEST_HRIS_" + provision_iam_policy_for_testing = false + aws_principal_arn_when_testing = null aws_write_role_to_assume_when_testing = null } diff --git a/infra/modules/aws-proxy-lambda/main.tf b/infra/modules/aws-proxy-lambda/main.tf index f13d52efb1..876f8d0e05 100644 --- a/infra/modules/aws-proxy-lambda/main.tf +++ b/infra/modules/aws-proxy-lambda/main.tf @@ -98,8 +98,8 @@ resource "aws_lambda_function" "instance" { length(var.path_to_shared_ssm_parameters) > 0 ? { PATH_TO_SHARED_CONFIG = var.path_to_shared_ssm_parameters } : {}, local.is_instance_ssm_prefix_default ? {} : { PATH_TO_INSTANCE_CONFIG = var.path_to_instance_ssm_parameters }, var.remote_resource_bucket != null ? { REMOTE_RESOURCE_BUCKET = var.remote_resource_bucket } : {}, - var.remote_resource_instance_path != null ? { INSTANCE_RESOURCE_PATH = var.remote_resource_instance_path } : {}, - var.remote_resource_shared_path != null ? { SHARED_RESOURCE_PATH = var.remote_resource_shared_path } : {}, + var.remote_resource_bucket != null && var.remote_resource_instance_path != null ? { INSTANCE_RESOURCE_PATH = var.remote_resource_instance_path } : {}, + var.remote_resource_bucket != null && var.remote_resource_shared_path != null ? { SHARED_RESOURCE_PATH = var.remote_resource_shared_path } : {}, ) } @@ -301,14 +301,14 @@ locals { remote_resource_instance_prefix = var.remote_resource_instance_path != null ? trimsuffix(var.remote_resource_instance_path, "/") : "" remote_resource_shared_prefix = var.remote_resource_shared_path != null ? trimsuffix(var.remote_resource_shared_path, "/") : "" - remote_resource_s3_object_arns = distinct(compact([ + remote_resource_s3_object_arns = var.remote_resource_bucket != null ? distinct(compact([ var.remote_resource_instance_path != null ? ( local.remote_resource_instance_prefix != "" ? "arn:aws:s3:::${var.remote_resource_bucket}/${local.remote_resource_instance_prefix}/*" : "arn:aws:s3:::${var.remote_resource_bucket}/*" ) : "", var.remote_resource_shared_path != null ? ( local.remote_resource_shared_prefix != "" ? "arn:aws:s3:::${var.remote_resource_bucket}/${local.remote_resource_shared_prefix}/*" : "arn:aws:s3:::${var.remote_resource_bucket}/*" ) : "", - ])) + ])) : [] remote_resource_bucket_statements = var.remote_resource_bucket != null ? [{ Sid = "ReadRemoteResourceBucket" diff --git a/infra/modules/aws-proxy-lambda/remote_resource_iam.tftest.hcl b/infra/modules/aws-proxy-lambda/remote_resource_iam.tftest.hcl index 57a4df55c2..7372d651f9 100644 --- a/infra/modules/aws-proxy-lambda/remote_resource_iam.tftest.hcl +++ b/infra/modules/aws-proxy-lambda/remote_resource_iam.tftest.hcl @@ -93,3 +93,18 @@ run "duplicate_prefixes_dedupe_to_one_arn" { condition = toset(one([for s in jsondecode(aws_iam_policy.required_resource_access.policy).Statement : s.Resource if s.Sid == "ReadRemoteResourceBucket"])) == toset(["arn:aws:s3:::my-artifacts/shared/*"]) } } + +run "paths_without_bucket_skip_iam" { + command = plan + + variables { + remote_resource_bucket = null + remote_resource_instance_path = "instances/foo/" + remote_resource_shared_path = "shared/models/" + } + + assert { + error_message = "paths without a bucket should not create ReadRemoteResourceBucket IAM statement" + condition = length([for s in jsondecode(aws_iam_policy.required_resource_access.policy).Statement : s if s.Sid == "ReadRemoteResourceBucket"]) == 0 + } +} diff --git a/infra/modules/aws-webhook-collector/variables.tf b/infra/modules/aws-webhook-collector/variables.tf index bd3a0cd073..6998bb28b6 100644 --- a/infra/modules/aws-webhook-collector/variables.tf +++ b/infra/modules/aws-webhook-collector/variables.tf @@ -201,11 +201,9 @@ variable "provision_auth_key" { validation { condition = ( - var.provision_auth_key == null || - ( - try(var.provision_auth_key.rotation_days, null) == null || - try(var.provision_auth_key.rotation_days, 0) > 0 - ) + var.provision_auth_key == null ? true : + var.provision_auth_key.rotation_days == null ? true : + var.provision_auth_key.rotation_days > 0 ) error_message = "If `provision_auth_key` is provided, `rotation_days` must be a positive number or null." } diff --git a/infra/modules/aws/main.tf b/infra/modules/aws/main.tf index 216bb367b2..36b8c4a096 100644 --- a/infra/modules/aws/main.tf +++ b/infra/modules/aws/main.tf @@ -158,8 +158,8 @@ module "psoxy_package" { locals { # determine if the JAR is local and should be uploaded directly from plan-time variables is_local_jar = var.deployment_bundle == null || !startswith(coalesce(var.deployment_bundle, "unknown"), "s3://") - should_provision_bucket = local.is_local_jar && var.artifacts_bucket_name == null - target_artifacts_bucket = var.artifacts_bucket_name != null ? var.artifacts_bucket_name : (local.should_provision_bucket ? aws_s3_bucket.artifacts[0].bucket : null) + should_provision_bucket = var.artifacts_bucket_name == null && (local.is_local_jar || var.enable_remote_resources) + target_artifacts_bucket = coalesce(var.artifacts_bucket_name, try(aws_s3_bucket.artifacts[0].bucket, null)) should_upload_object = local.is_local_jar && (var.artifacts_bucket_name != null || local.should_provision_bucket) } diff --git a/infra/modules/aws/variables.tf b/infra/modules/aws/variables.tf index 5ea5e4c533..830ac85967 100644 --- a/infra/modules/aws/variables.tf +++ b/infra/modules/aws/variables.tf @@ -148,10 +148,16 @@ variable "enable_webhook_testing" { variable "artifacts_bucket_name" { type = string - description = "Name of an existing S3 bucket to use for deployment artifacts. If null, one will be provisioned if needed." + description = "Name of an existing S3 bucket to use for deployment artifacts and remote resources (rules, NLP models, etc.). If null, one will be provisioned when needed for a local deployment bundle or when enable_remote_resources is true." default = null } +variable "enable_remote_resources" { + type = bool + description = "Whether to provision an artifacts bucket for remote resources when one is not otherwise needed for deployment (e.g. with an s3:// deployment_bundle)." + default = false +} + variable "allowed_data_access_ip_blocks" { description = <<-EOT IPs or CIDR blocks allowed to make data access requests. When non-empty, adds infrastructure-level aws:SourceIp conditions on api-caller role assume-role policies (see docs/configuration/ip-allowlisting.md). Application-layer enforcement is configured separately on proxy Lambdas via the host module. diff --git a/infra/modules/gcp-host/bulk_service_config.tftest.hcl b/infra/modules/gcp-host/bulk_service_config.tftest.hcl new file mode 100644 index 0000000000..301d5409bf --- /dev/null +++ b/infra/modules/gcp-host/bulk_service_config.tftest.hcl @@ -0,0 +1,123 @@ + +# Test bulk connector memory/CPU defaults in gcp-host module + +variables { + gcp_project_id = "test-project-123456" + environment_name = "test" + worklytics_sa_emails = ["test@example.com"] + psoxy_base_dir = "../../../" + + bulk_connectors = { + "default-bulk" = { + source_kind = "hris" + rules = { + columnsToPseudonymize = ["employee_id"] + } + } + "high-memory" = { + source_kind = "hris" + available_memory_mb = 4096 + rules = { + columnsToPseudonymize = ["employee_id"] + } + } + "custom-memory" = { + source_kind = "hris" + available_memory_mb = 2048 + rules = { + columnsToPseudonymize = ["employee_id"] + } + } + } + + custom_bulk_connector_arguments = { + "args-override" = { + available_memory_mb = 512 + } + } + + api_connectors = {} + webhook_collectors = {} +} + +mock_provider "google" { + mock_data "google_project" { + defaults = { + project_id = "test-project-123456" + number = 123456789 + } + } + + mock_data "google_compute_default_service_account" { + defaults = { + email = "123456789-compute@developer.gserviceaccount.com" + name = "projects/test-project-123456/serviceAccounts/123456789-compute@developer.gserviceaccount.com" + } + } +} + +run "setup" { + command = plan + + variables { + bulk_connectors = merge(var.bulk_connectors, { + "args-override" = { + source_kind = "hris" + rules = { + columnsToPseudonymize = ["employee_id"] + } + } + }) + } +} + +run "default_memory_and_cpu" { + command = plan + + assert { + error_message = "Default bulk memory should be 1024M" + condition = run.setup.bulk_connector["default-bulk"].function_config.service_config[0].available_memory == "1024M" + } + + assert { + error_message = "Default bulk CPU should be 0.5 for 1024M memory" + condition = run.setup.bulk_connector["default-bulk"].function_config.service_config[0].available_cpu == "0.5" + } +} + +run "auto_cpu_for_high_memory" { + command = plan + + assert { + error_message = "4096M memory should auto-select 1 CPU" + condition = run.setup.bulk_connector["high-memory"].function_config.service_config[0].available_cpu == "1" + } +} + +run "custom_memory_with_auto_cpu" { + command = plan + + assert { + error_message = "Configured memory should be applied" + condition = run.setup.bulk_connector["custom-memory"].function_config.service_config[0].available_memory == "2048M" + } + + assert { + error_message = "2048M memory should auto-select 1 CPU" + condition = run.setup.bulk_connector["custom-memory"].function_config.service_config[0].available_cpu == "1" + } +} + +run "custom_bulk_connector_arguments_override" { + command = plan + + assert { + error_message = "custom_bulk_connector_arguments should override connector memory" + condition = run.setup.bulk_connector["args-override"].function_config.service_config[0].available_memory == "512M" + } + + assert { + error_message = "512M memory should auto-select 0.333 CPU" + condition = run.setup.bulk_connector["args-override"].function_config.service_config[0].available_cpu == "0.333" + } +} diff --git a/infra/modules/gcp-host/main.tf b/infra/modules/gcp-host/main.tf index 701d1e37b2..f80d44bc52 100644 --- a/infra/modules/gcp-host/main.tf +++ b/infra/modules/gcp-host/main.tf @@ -26,6 +26,7 @@ locals { connector_instance_resource_path = { for k, v in merge(var.api_connectors, var.bulk_connectors, var.webhook_collectors) : k => "${local.shared_resource_path}${replace(upper(k), "-", "_")}/" } + remote_resources_enabled = var.enable_remote_resources && module.psoxy.artifacts_bucket_name != null # rules_file paths may be absolute, relative to the Terraform root module (deployment dir), or # relative to psoxy_base_dir (paths into the psoxy repo, eg docs/sources/...) @@ -57,6 +58,14 @@ locals { if try(v.rules_file, null) != null } + bulk_connector_available_memory_mb = { + for k, v in var.bulk_connectors : k => coalesce( + try(var.custom_bulk_connector_arguments[k].available_memory_mb, null), + try(v.available_memory_mb, null), + 1024, + ) + } + webhook_collector_rules_file_paths = { for k, v in var.webhook_collectors : k => local._resolved_rules_file_paths[v.rules_file] if try(v.rules_file, null) != null @@ -99,6 +108,7 @@ module "psoxy" { provision_testing_infra = var.provision_testing_infra gcp_principals_authorized_to_test = var.gcp_principals_authorized_to_test custom_artifacts_bucket_name = var.custom_artifacts_bucket_name + enable_remote_resources = var.enable_remote_resources support_bulk_mode = length(var.bulk_connectors) > 0 support_webhook_collectors = length(var.webhook_collectors) > 0 vpc_config = var.vpc_config @@ -237,7 +247,7 @@ module "api_connector" { environment_id_prefix = local.environment_id_prefix instance_id = each.key service_account_email = google_service_account.api_connectors[each.key].email - artifacts_bucket_name = module.psoxy.artifacts_bucket_name + artifacts_bucket_name = module.psoxy.deployment_bundle_bucket deployment_bundle_object_name = module.psoxy.deployment_bundle_object_name artifact_repository_id = module.psoxy.artifact_repository vpc_config = module.psoxy.vpc_config @@ -282,9 +292,9 @@ module "api_connector" { var.general_environment_variables, ) - remote_resource_bucket = var.enable_remote_resources ? module.psoxy.artifacts_bucket_name : null - remote_resource_instance_path = var.enable_remote_resources ? local.connector_instance_resource_path[each.key] : null - remote_resource_shared_path = var.enable_remote_resources ? local.shared_resource_path : null + remote_resource_bucket = local.remote_resources_enabled ? module.psoxy.artifacts_bucket_name : null + remote_resource_instance_path = local.remote_resources_enabled ? local.connector_instance_resource_path[each.key] : null + remote_resource_shared_path = local.remote_resources_enabled ? local.shared_resource_path : null secret_bindings = merge( local.secrets_bound_as_env_vars[each.key], @@ -341,7 +351,7 @@ module "webhook_collector" { service_account_id = google_service_account.webhook_collector[each.key].id email = google_service_account.webhook_collector[each.key].email } - artifacts_bucket_name = module.psoxy.artifacts_bucket_name + artifacts_bucket_name = module.psoxy.deployment_bundle_bucket deployment_bundle_object_name = module.psoxy.deployment_bundle_object_name artifact_repository_id = module.psoxy.artifact_repository path_to_repo_root = var.psoxy_base_dir @@ -378,9 +388,9 @@ module "webhook_collector" { var.general_environment_variables, ) - remote_resource_bucket = var.enable_remote_resources ? module.psoxy.artifacts_bucket_name : null - remote_resource_instance_path = var.enable_remote_resources ? local.connector_instance_resource_path[each.key] : null - remote_resource_shared_path = var.enable_remote_resources ? local.shared_resource_path : null + remote_resource_bucket = local.remote_resources_enabled ? module.psoxy.artifacts_bucket_name : null + remote_resource_instance_path = local.remote_resources_enabled ? local.connector_instance_resource_path[each.key] : null + remote_resource_shared_path = local.remote_resources_enabled ? local.shared_resource_path : null secret_bindings = module.psoxy.secrets @@ -404,7 +414,7 @@ module "bulk_connector" { worklytics_sa_emails = var.worklytics_sa_emails config_parameter_prefix = local.config_parameter_prefix source_kind = each.value.source_kind - artifacts_bucket_name = module.psoxy.artifacts_bucket_name + artifacts_bucket_name = module.psoxy.deployment_bundle_bucket artifact_repository_id = module.psoxy.artifact_repository deployment_bundle_object_name = module.psoxy.deployment_bundle_object_name psoxy_base_dir = var.psoxy_base_dir @@ -420,7 +430,7 @@ module "bulk_connector" { sanitized_bucket_name = try(each.value.sanitized_bucket_name, null) todos_as_local_files = var.todos_as_local_files tf_runner_iam_principal = module.tf_runner.iam_principal - available_memory_mb = coalesce(try(var.custom_bulk_connector_arguments[each.key].available_memory_mb, null), try(each.value.available_memory_mb, null), 512) + available_memory_mb = local.bulk_connector_available_memory_mb[each.key] timeout_seconds = coalesce(try(var.custom_bulk_connector_arguments[each.key].timeout_seconds, null), try(each.value.timeout_seconds, null), 540) gcp_principals_authorized_to_test = var.gcp_principals_authorized_to_test bucket_force_destroy = var.bucket_force_destroy @@ -446,9 +456,9 @@ module "bulk_connector" { var.general_environment_variables, ) - remote_resource_bucket = var.enable_remote_resources ? module.psoxy.artifacts_bucket_name : null - remote_resource_instance_path = var.enable_remote_resources ? local.connector_instance_resource_path[each.key] : null - remote_resource_shared_path = var.enable_remote_resources ? local.shared_resource_path : null + remote_resource_bucket = local.remote_resources_enabled ? module.psoxy.artifacts_bucket_name : null + remote_resource_instance_path = local.remote_resources_enabled ? local.connector_instance_resource_path[each.key] : null + remote_resource_shared_path = local.remote_resources_enabled ? local.shared_resource_path : null depends_on = [ module.psoxy # some of the set-up IAM grants done there, but not EXPLICITLY passed out as outputs and into above as inputs, are required; so make this explicit diff --git a/infra/modules/gcp-host/variables.tf b/infra/modules/gcp-host/variables.tf index a5a65cff0c..7fdb6e3d67 100644 --- a/infra/modules/gcp-host/variables.tf +++ b/infra/modules/gcp-host/variables.tf @@ -181,7 +181,7 @@ variable "kms_key_ring" { variable "custom_artifacts_bucket_name" { type = string - description = "name of bucket to use for custom artifacts, if you want something other than default" + description = "Name of an existing GCS bucket to use for deployment artifacts and remote resources (rules, NLP models, etc.). If null, one will be provisioned when needed for a local deployment bundle or when enable_remote_resources is true." default = null } @@ -463,6 +463,6 @@ variable "max_instances_per_api_connector" { variable "enable_remote_resources" { type = bool - description = "**beta** Whether to enable remote resource loading from the artifacts GCS bucket (rules, NLP models, etc.). When true, sets REMOTE_RESOURCE_BUCKET env var and grants roles/storage.objectViewer to each Cloud Function. Default will change to `true` in next major version." - default = false # will change to true in 0.7.x + description = "**beta** Whether to enable remote resource loading from the artifacts GCS bucket (rules, NLP models, etc.). When true, sets REMOTE_RESOURCE_BUCKET env var and grants roles/storage.objectViewer to each Cloud Function. Provisions an artifacts bucket if one is not already created or provided." + default = false } diff --git a/infra/modules/gcp-proxy-api/main.tf b/infra/modules/gcp-proxy-api/main.tf index 941982c9a3..37e3c1be4d 100644 --- a/infra/modules/gcp-proxy-api/main.tf +++ b/infra/modules/gcp-proxy-api/main.tf @@ -291,7 +291,7 @@ resource "google_cloudfunctions2_function" "function" { iterator = secret_environment_variable content { - key = secret_environment_variable.key + key = secret_environment_variable.key # project_id string (not number) avoids apply-time drift: number comes from data.google_project project_id = var.gcp_project.project_id secret = secret_environment_variable.value.secret_id @@ -400,14 +400,37 @@ locals { "${request.method} ${request.path}" => join("", [for name, value in try(request.headers, {}) : " -H \"${name}: ${value}\""]) } + # Shell script positional args; omit empty content-type/body/header slots unless needed + example_api_script_invocation = { for request in local.all_example_api_requests : + "${request.method} ${request.path}" => join(" ", concat( + [request.method, "'${replace(request.path, "'", "'\\''")}'"], + request.body != null ? [ + coalesce(request.content_type, "application/json"), + "'${replace(request.body, "\"", "\\\"")}'" + ] : [], + (request.body == null && trimspace(lookup(local.example_request_header_flags, "${request.method} ${request.path}", "")) != "") ? ["''", "''"] : [], + trimspace(lookup(local.example_request_header_flags, "${request.method} ${request.path}", "")) != "" ? [ + "'${replace(trimspace(lookup(local.example_request_header_flags, "${request.method} ${request.path}", "")), "'", "'\\''")}'" + ] : [] + )) + } + example_api_get_requests_for_script = [for r in local.example_api_get_requests : merge(r, { - header_flags = trimspace(lookup(local.example_request_header_flags, "${r.method} ${r.path}", "")) + header_flags = trimspace(lookup(local.example_request_header_flags, "${r.method} ${r.path}", "")) + script_invocation = local.example_api_script_invocation["${r.method} ${r.path}"] })] example_api_post_requests_for_script = [for r in local.all_example_api_requests : merge(r, { - header_flags = trimspace(lookup(local.example_request_header_flags, "${r.method} ${r.path}", "")) + header_flags = trimspace(lookup(local.example_request_header_flags, "${r.method} ${r.path}", "")) + script_invocation = local.example_api_script_invocation["${r.method} ${r.path}"] }) if r.method == "POST" && r.body != null] + example_api_script_invocations = concat( + [for r in local.example_api_get_requests_for_script : r.script_invocation], + [for r in local.example_api_post_requests_for_script : r.script_invocation] + ) + + test_script_filename = "test-${trimprefix(var.instance_id, var.environment_id_prefix)}.sh" default_header_flags = length(local.example_api_get_requests_for_script) > 0 ? local.example_api_get_requests_for_script[0].header_flags : "" # Generate test calls from all example requests @@ -450,6 +473,20 @@ Feel free to try the above calls, and reference to the source's API docs for oth endpoints to experiment with. If you spot any additional fields you believe should be redacted/pseudonymized, feel free to modify [customize the rules](${var.path_to_repo_root}docs/gcp/custom-rules.md). +### Or use the test script + +We also generated `${local.test_script_filename}`, a wrapper script around the test tool. +Run it with no arguments to exercise the default example endpoint, or pass arguments to try +others: + +```shell +./${local.test_script_filename} +``` + +```shell +${join("\n", [for invocation in local.example_api_script_invocations : "./${local.test_script_filename} ${invocation}"])} +``` + ### Check logs (GCP runtime logs) Based on your configuration, the following command allows you to inspect the logs of your Psoxy diff --git a/infra/modules/gcp-proxy-api/test_script.tftpl b/infra/modules/gcp-proxy-api/test_script.tftpl index bc2b7e73c4..558baaff0d 100644 --- a/infra/modules/gcp-proxy-api/test_script.tftpl +++ b/infra/modules/gcp-proxy-api/test_script.tftpl @@ -27,10 +27,10 @@ ASYNC_CALL_RC=0 echo "Invoke this script with any of the following as arguments to test other endpoints:" %{ for example_api_get_request in example_api_get_requests ~} - printf "\t%s\n" "${example_api_get_request.method} '${example_api_get_request.path}' '' '' '${replace(example_api_get_request.header_flags, "'", "'\\''")}'" + printf "\t%s\n" "${example_api_get_request.script_invocation}" %{ endfor ~} %{ for example_api_post_request in example_api_post_requests ~} - printf "\t%s\n" "${example_api_post_request.method} '${example_api_post_request.path}' ${example_api_post_request.content_type} '${replace(example_api_post_request.body, "\"", "\\\"")}' '${replace(example_api_post_request.header_flags, "'", "'\\''")}'" + printf "\t%s\n" "${example_api_post_request.script_invocation}" %{ endfor ~} exit $(( HEALTHCHECK_RC + SYNC_CALL_RC + ASYNC_CALL_RC )) diff --git a/infra/modules/gcp-proxy-bulk/main.tf b/infra/modules/gcp-proxy-bulk/main.tf index 469f21288d..871630a6db 100644 --- a/infra/modules/gcp-proxy-bulk/main.tf +++ b/infra/modules/gcp-proxy-bulk/main.tf @@ -5,6 +5,22 @@ locals { CLOUD_FUNCTION_NAME_MAX_LENGTH = 63 } +# GCP Cloud Run / Cloud Functions gen2 minimum CPU for a memory allocation. +# https://cloud.google.com/run/docs/configuring/services/cpu#cpu-memory +locals { + resolved_memory_mb = coalesce(var.available_memory_mb, 1024) + + auto_available_cpu = ( + local.resolved_memory_mb <= 512 ? "0.333" : + local.resolved_memory_mb <= 1024 ? "0.5" : + local.resolved_memory_mb <= 4096 ? "1" : + local.resolved_memory_mb <= 8192 ? "2" : + local.resolved_memory_mb <= 16384 ? "4" : + local.resolved_memory_mb <= 24576 ? "6" : + "8" + ) +} + # computed locals { # legacy pre 0.5 may not pass instance_id @@ -225,7 +241,8 @@ resource "google_cloudfunctions2_function" "function" { # NOTE: bulk connectors are STILL triggered through HTTPS invocations, so these have default HTTPS endpoint URls service_config { - available_memory = "${coalesce(var.available_memory_mb, 1024)}M" + available_memory = "${local.resolved_memory_mb}M" + available_cpu = local.auto_available_cpu service_account_email = google_service_account.service_account.email timeout_seconds = var.timeout_seconds ingress_settings = "ALLOW_INTERNAL_ONLY" @@ -251,7 +268,7 @@ resource "google_cloudfunctions2_function" "function" { iterator = secret_environment_variable content { - key = secret_environment_variable.key + key = secret_environment_variable.key # project_id string (not number) avoids apply-time drift: number comes from data.google_project project_id = var.gcp_project.project_id secret = secret_environment_variable.value.secret_id diff --git a/infra/modules/gcp-proxy-bulk/variables.tf b/infra/modules/gcp-proxy-bulk/variables.tf index e9ccec2ccb..e21fd11710 100644 --- a/infra/modules/gcp-proxy-bulk/variables.tf +++ b/infra/modules/gcp-proxy-bulk/variables.tf @@ -97,6 +97,8 @@ variable "bucket_write_role_id" { variable "available_memory_mb" { type = number + # TODO: future version - replace with available_memory (string), passed through to + # google_cloudfunctions2_function.service_config.available_memory (e.g. "1024M"). description = "Memory (in MB), available to the function. Default value is 1024. Possible values include 128, 256, 512, 1024, 2048, 4096; above that requires multiple CPUs, beyond scope of our built-in configurations." default = 1024 } diff --git a/infra/modules/gcp-webhook-collector/ip_lock_conditions.tftest.hcl b/infra/modules/gcp-webhook-collector/ip_lock_conditions.tftest.hcl index bbc3ad1fe4..d7193c665f 100644 --- a/infra/modules/gcp-webhook-collector/ip_lock_conditions.tftest.hcl +++ b/infra/modules/gcp-webhook-collector/ip_lock_conditions.tftest.hcl @@ -23,7 +23,7 @@ variables { provision_auth_key = { rotation_days = 30 } - example_identity = "test-user@example.com" + example_identity = "test-user@example.com" allowed_webhook_ip_blocks = ["10.0.0.0/16"] } diff --git a/infra/modules/gcp-webhook-collector/main.tf b/infra/modules/gcp-webhook-collector/main.tf index 0e31555d23..fbf1c3a8a1 100644 --- a/infra/modules/gcp-webhook-collector/main.tf +++ b/infra/modules/gcp-webhook-collector/main.tf @@ -367,7 +367,7 @@ resource "google_cloudfunctions2_function" "function" { iterator = secret_environment_variable content { - key = secret_environment_variable.key + key = secret_environment_variable.key # project_id string (not number) avoids apply-time drift: number comes from data.google_project project_id = var.gcp_project.project_id secret = secret_environment_variable.value.secret_id diff --git a/infra/modules/gcp/main.tf b/infra/modules/gcp/main.tf index b5b2085059..e141d4bad1 100644 --- a/infra/modules/gcp/main.tf +++ b/infra/modules/gcp/main.tf @@ -201,9 +201,10 @@ module "psoxy_package" { locals { # NOTE: `try` needed here bc Terraform doesn't short-circuit boolean evaluation - is_remote_bundle = var.deployment_bundle != null && try(startswith(var.deployment_bundle, "gs://"), false) - remote_bucket_name = local.is_remote_bundle ? split("/", var.deployment_bundle)[2] : null - remote_bundle_artifact = local.is_remote_bundle ? split("/", var.deployment_bundle)[3] : null + is_remote_bundle = var.deployment_bundle != null && try(startswith(var.deployment_bundle, "gs://"), false) + remote_bucket_name = local.is_remote_bundle ? split("/", var.deployment_bundle)[2] : null + remote_bundle_artifact = local.is_remote_bundle ? join("/", slice(split("/", var.deployment_bundle), 3, length(split("/", var.deployment_bundle)))) : null + should_provision_artifacts_bucket = var.custom_artifacts_bucket_name == null && (!local.is_remote_bundle || var.enable_remote_resources) file_name_with_sha1 = local.is_remote_bundle ? sha1(var.deployment_bundle) : replace(module.psoxy_package.filename, ".jar", "_${filesha1(module.psoxy_package.path_to_deployment_jar)}.zip") @@ -229,7 +230,7 @@ data "archive_file" "source" { # trivy:ignore:AVD-GCP-0078 # trivy:ignore:AVD-GCP-0077 resource "google_storage_bucket" "artifacts" { - count = local.is_remote_bundle ? 0 : 1 + count = local.should_provision_artifacts_bucket ? 1 : 0 project = var.project_id name = coalesce(var.custom_artifacts_bucket_name, "${var.project_id}-${var.environment_id_prefix}artifacts-bucket") @@ -259,7 +260,10 @@ resource "google_storage_bucket_object" "function" { } locals { - artifact_bucket_name = local.is_remote_bundle ? local.remote_bucket_name : google_storage_bucket.artifacts[0].name + # NOTE: not coalesce; Terraform evaluates all coalesce() args even when an earlier one is non-null, + # and coalesce fails when every argument is null (e.g. prebuilt gs:// bundle without remote resources). + artifacts_bucket_name = var.custom_artifacts_bucket_name != null ? var.custom_artifacts_bucket_name : try(google_storage_bucket.artifacts[0].name, null) + deployment_bundle_bucket = local.is_remote_bundle ? local.remote_bucket_name : local.artifacts_bucket_name deployment_bundle_object_name = local.is_remote_bundle ? local.remote_bundle_artifact : google_storage_bucket_object.function[0].name } @@ -535,7 +539,12 @@ output "gcp_project" { } output "artifacts_bucket_name" { - value = local.artifact_bucket_name + value = local.artifacts_bucket_name +} + +output "deployment_bundle_bucket" { + value = local.deployment_bundle_bucket + description = "GCS bucket containing the Cloud Function deployment bundle (may differ from artifacts_bucket_name when using a gs:// deployment_bundle)" } output "artifacts_bucket_id" { diff --git a/infra/modules/gcp/variables.tf b/infra/modules/gcp/variables.tf index e742eb2e76..df1f73f565 100644 --- a/infra/modules/gcp/variables.tf +++ b/infra/modules/gcp/variables.tf @@ -96,10 +96,16 @@ variable "gcp_principals_authorized_to_test" { variable "custom_artifacts_bucket_name" { type = string - description = "name of bucket to use for custom artifacts, if you want something other than default" + description = "Name of an existing GCS bucket to use for deployment artifacts and remote resources (rules, NLP models, etc.). If null, one will be provisioned when needed for a local deployment bundle or when enable_remote_resources is true." default = null } +variable "enable_remote_resources" { + type = bool + description = "Whether to provision an artifacts bucket for remote resources when one is not otherwise needed for deployment (e.g. with a gs:// deployment_bundle)." + default = false +} + variable "support_bulk_mode" { type = bool diff --git a/infra/modules/google-workspace-dwd-connection/main.tf b/infra/modules/google-workspace-dwd-connection/main.tf index d82ea9364d..00c75ebfae 100644 --- a/infra/modules/google-workspace-dwd-connection/main.tf +++ b/infra/modules/google-workspace-dwd-connection/main.tf @@ -19,11 +19,21 @@ locals { # TODO: md5 here is 32 chars of hex, so some risk of collision by truncating sa_account_id = length(local.padded_id) < 31 ? local.padded_id : substr(md5(local.padded_id), 0, 30) - instance_id = coalesce(var.instance_id, var.display_name) + instance_id = coalesce(var.instance_id, var.display_name) + expected_sa_email = "${local.sa_account_id}@${var.project_id}.iam.gserviceaccount.com" + oauth_client_id = var.provision_service_account ? google_service_account.connector_sa[0].unique_id : "REPLACE_WITH_NUMERIC_CLIENT_ID_AFTER_CREATING_SERVICE_ACCOUNT" + service_account_email_for_todo = var.provision_service_account ? google_service_account.connector_sa[0].email : local.expected_sa_email } # service account to personify connector +moved { + from = google_service_account.connector_sa + to = google_service_account.connector_sa[0] +} + resource "google_service_account" "connector_sa" { + count = var.provision_service_account ? 1 : 0 + project = var.project_id account_id = local.sa_account_id display_name = var.display_name @@ -31,7 +41,7 @@ resource "google_service_account" "connector_sa" { } resource "google_project_service" "apis_needed" { - for_each = toset(var.apis_consumed) + for_each = var.enable_apis ? toset(var.apis_consumed) : toset([]) service = each.key project = var.project_id @@ -69,14 +79,14 @@ locals { If you have already created a sufficiently privileged service account user for a different Google Workspace connection, you can re-use that one. - 6. Assign the account a sufficiently privileged role. At minimum, the role must have permission - to READ the following [Administrator Setting Privileges](https://support.google.com/a/answer/1219251): - * Admin API - * Domain Settings - * Groups - * Organizational Units - * Reports (required only if you are connecting to the Audit Logs, used for Google Chat, Meet, etc) - * Users + 6. Assign the account a sufficiently privileged role. At minimum, the role must grant read-only + access to the following [Administrator privileges](https://knowledge.workspace.google.com/admin/users/administrator-privilege-definitions) + (expand each category in the Custom Role editor and enable only the Read sub-action): + * Users → Read (required) + * Groups → Read (required) + * Organizational Units → Read (optional; for org-unit segmentation) + * Domain Management (optional; for list of internal domains) + * Reports (required only if connecting to Google Chat, Google Meet, or other audit-log connectors) You may use a predefined role, or define a [Custom Role](https://support.google.com/a/answer/2406043?fl=1). (NOTE: Steps 5/6 are optional, but highly recommended. You could use the account of a sufficiently @@ -88,33 +98,40 @@ connection to will fail) Account to Use for Connection' setting when they create the connection. 8. Optionally, you may also set the email address of the account you created the value of - `google_workspace_example_user` in your `terraform.tfvars` file. This will cause the example + `google_workspace_connector_settings` (eg, `example_user`) in your `terraform.tfvars` file. This will cause the example API invocations generated by the terraform modules to prefill this value as the account to impersonate on those requests. This will allow you to validate the permissions of the account, as well as the ability of the proxy connection to impersonate it. EOT + manual_sa_todo_note = var.provision_service_account ? "" : <<-EOT + + NOTE: Terraform did not provision this service account. After you create it manually, replace + the placeholder client ID above with the numeric ID shown in the GCP console for + `${local.expected_sa_email}`. +EOT + todo_content = < "Access and Data Control" --> "API Controls", then find "Manage Domain Wide Delegation". Click "Add new". - 2. Copy and paste client ID `${google_service_account.connector_sa.unique_id}` into the + 2. Copy and paste client ID `${local.oauth_client_id}` into the "Client ID" input in the popup. (this is the unique ID of the GCP service account with - email `${google_service_account.connector_sa.email}`; you can (and should) verify its identity - via the GCP console, with the project `${google_service_account.connector_sa.project}`, under: + email `${local.service_account_email_for_todo}`; you can (and should) verify its identity + via the GCP console, with the project `${var.project_id}`, under: - ["IAM & Admin" --> "Service Accounts"](https://console.cloud.google.com/iam-admin/serviceaccounts?project=${google_service_account.connector_sa.project}&supportedpurview=project) + ["IAM & Admin" --> "Service Accounts"](https://console.cloud.google.com/iam-admin/serviceaccounts?project=${var.project_id}&supportedpurview=project) This ensures you are granting domain-wide delegation to the correct service account, and mitigates the risk that these instructions were forged by a malicious actor. - +${local.manual_sa_todo_note} Via the GCP console, you can also verify all extant keys for the service account, to ensure that there is exactly one, which should be held by the proxy. GCP provides log of key usage, creation, revocation, etc, which you can monitor to ensure that the key is being used only by the proxy, only for the data access you expect. If you ever suspect compromise, you may revoke - the key from the GCP console at any time (NOTE: that proxy connection will be broken until your - Terraform configuration is re-applied, to provision a new key). + the key from the GCP console at any time (NOTE: that proxy connection will be broken until a new + key is provisioned and stored in your secrets manager). 3. Copy and paste the following OAuth 2.0 scope string into the "Scopes" input: ``` @@ -122,7 +139,7 @@ ${join(",", var.oauth_scopes_needed)} ``` 4. Authorize it. With this, your psoxy instance should be able to authenticate with Google as - the GCP Service Account `${google_service_account.connector_sa.email}` and request data from + the GCP Service Account `${local.service_account_email_for_todo}` and request data from Google as authorized by the OAuth scopes you granted. ${local.google_workspace_admin_account_required ? local.google_workspace_service_account_setup : ""} EOT diff --git a/infra/modules/google-workspace-dwd-connection/output.tf b/infra/modules/google-workspace-dwd-connection/output.tf index b362346589..2eaec8c4cd 100644 --- a/infra/modules/google-workspace-dwd-connection/output.tf +++ b/infra/modules/google-workspace-dwd-connection/output.tf @@ -3,15 +3,16 @@ output "instance_id" { } output "service_account_id" { - value = google_service_account.connector_sa.id + value = var.provision_service_account ? google_service_account.connector_sa[0].id : "projects/${var.project_id}/serviceAccounts/${local.expected_sa_email}" } output "service_account_email" { - value = google_service_account.connector_sa.email + value = var.provision_service_account ? google_service_account.connector_sa[0].email : local.expected_sa_email } output "service_account_numeric_id" { - value = google_service_account.connector_sa.unique_id + value = var.provision_service_account ? google_service_account.connector_sa[0].unique_id : null + description = "OAuth client ID for domain-wide delegation; null if the service account is not provisioned by Terraform" } output "next_todo_step" { diff --git a/infra/modules/google-workspace-dwd-connection/variables.tf b/infra/modules/google-workspace-dwd-connection/variables.tf index f527babc03..73acc2addf 100644 --- a/infra/modules/google-workspace-dwd-connection/variables.tf +++ b/infra/modules/google-workspace-dwd-connection/variables.tf @@ -36,6 +36,18 @@ variable "oauth_scopes_needed" { default = [] } +variable "provision_service_account" { + type = bool + description = "whether to provision the GCP service account (OAuth client) via Terraform. If false, you must create it manually." + default = true +} + +variable "enable_apis" { + type = bool + description = "whether to enable required GCP APIs via Terraform. If false, you must enable them manually." + default = true +} + variable "todos_as_local_files" { type = bool description = "whether to render TODOs as flat files" @@ -47,4 +59,3 @@ variable "todo_step" { description = "of all todos, where does this one logically fall in sequence" default = 1 } - diff --git a/infra/modules/worklytics-connector-specs/google-workspace.tf b/infra/modules/worklytics-connector-specs/google-workspace.tf index ff13d640d3..61091d5752 100644 --- a/infra/modules/worklytics-connector-specs/google-workspace.tf +++ b/infra/modules/worklytics-connector-specs/google-workspace.tf @@ -1,13 +1,15 @@ locals { google_workspace_example_user = try( - coalesce(var.google_workspace_connector_settings["google_workspace_example_user"]), + coalesce(var.google_workspace_connector_settings["example_user"]), coalesce(var.google_workspace_example_user, "REPLACE_WITH_EXAMPLE_USER@YOUR_COMPANY.COM") ) google_workspace_example_admin = try( - coalesce(var.google_workspace_connector_settings["google_workspace_example_admin"]), + coalesce(var.google_workspace_connector_settings["example_admin"]), coalesce(var.google_workspace_example_admin, local.google_workspace_example_user, "REPLACE_WITH_EXAMPLE_ADMIN@YOUR_COMPANY.COM") ) + # oauth_scopes_needed below are documented (short form, without the + # https://www.googleapis.com/auth/ prefix) in docs/sources/google-workspace/. google_workspace_sources = { "gcal" : { source_kind : "gcal", diff --git a/infra/modules/worklytics-connector-specs/main.tf b/infra/modules/worklytics-connector-specs/main.tf index 33e69a1071..9fbe17887d 100644 --- a/infra/modules/worklytics-connector-specs/main.tf +++ b/infra/modules/worklytics-connector-specs/main.tf @@ -109,7 +109,7 @@ EOT } chatgpt-enterprise = { source_kind : "chatgpt-enterprise", - availability : "alpha", + availability : "beta", enable_by_default : false, worklytics_connector_id : "chatgpt-enterprise-psoxy" display_name : "ChatGPT Enterprise" diff --git a/infra/modules/worklytics-connector-specs/msft-365.tf b/infra/modules/worklytics-connector-specs/msft-365.tf index e827f52188..1328ef11b9 100644 --- a/infra/modules/worklytics-connector-specs/msft-365.tf +++ b/infra/modules/worklytics-connector-specs/msft-365.tf @@ -67,6 +67,7 @@ locals { enable_side_output : false example_api_calls : [ "/v1.0/users", + "/v1.0/users?\\$select=id,mail,otherMails", "/v1.0/users/${local.example_msft_user_guid}/events", "/v1.0/users/${local.example_msft_user_guid}/calendarView?startDateTime=${timeadd(var.example_api_calls_sample_date, "-4320h")}&endDateTime=${var.example_api_calls_sample_date}", "/v1.0/users/${local.example_msft_user_guid}/mailboxSettings", @@ -94,6 +95,7 @@ locals { enable_side_output : false example_api_calls : [ "/v1.0/users", + "/v1.0/users?\\$select=id,mail,otherMails", "/v1.0/users/${local.example_msft_user_guid}/mailboxSettings", "/v1.0/users/${local.example_msft_user_guid}/mailFolders/SentItems/messages", "/v1.0/groups", diff --git a/infra/modules/worklytics-connector-specs/variables.tf b/infra/modules/worklytics-connector-specs/variables.tf index 10d56bf490..b662de03ec 100644 --- a/infra/modules/worklytics-connector-specs/variables.tf +++ b/infra/modules/worklytics-connector-specs/variables.tf @@ -237,6 +237,6 @@ variable "msft_365_connector_settings" { variable "google_workspace_connector_settings" { type = map(any) - description = "Map of configuration settings specifically for Google Workspace connectors (e.g. example users). Note that provider-controlling parameters (like GCP project IDs or impersonation SAs) remain top-level variables." + description = "Map of configuration settings specifically for Google Workspace connectors. Supported keys: example_user, example_admin, provision_keys, key_rotation_days, provision_service_accounts, enable_apis. Provider-controlling parameters (like GCP project IDs or impersonation SAs) remain top-level variables." default = {} } diff --git a/infra/modules/worklytics-connectors-google-workspace/gcp-api-enable-todo.tftpl b/infra/modules/worklytics-connectors-google-workspace/gcp-api-enable-todo.tftpl new file mode 100644 index 0000000000..76bf2cb238 --- /dev/null +++ b/infra/modules/worklytics-connectors-google-workspace/gcp-api-enable-todo.tftpl @@ -0,0 +1,16 @@ +In the GCP console for `${gcp_project_id}` (or via `gcloud`), enable the following APIs required by the `${connector_id}` connector: + +%{ for api in apis_consumed ~} +- `${api}` +%{ endfor ~} + +Via the GCP console: navigate to "APIs & Services" --> "Library", search for each API above, and click "Enable". + +Via gcloud (one command per API): +%{ for api in apis_consumed ~} +`gcloud services enable ${api} --project=${gcp_project_id}` +%{ endfor ~} + +See the page below for more information on provisioning Google Workspace connectors without Terraform: + +https://docs.worklytics.co/psoxy/sources/google-workspace#provisioning-api-clients-without-terraform diff --git a/infra/modules/worklytics-connectors-google-workspace/gcp-sa-create-todo.tftpl b/infra/modules/worklytics-connectors-google-workspace/gcp-sa-create-todo.tftpl new file mode 100644 index 0000000000..461e53cfcc --- /dev/null +++ b/infra/modules/worklytics-connectors-google-workspace/gcp-sa-create-todo.tftpl @@ -0,0 +1,18 @@ +In the GCP console for `${gcp_project_id}` (or via `gcloud`), create a service account to use as the OAuth client for the `${connector_id}` connector: + +- **Account ID**: `${service_account_id}` +- **Display name**: `${display_name}` +- **Description**: `${description}` + +The service account email should be `${expected_service_account_email}`. + +Via the GCP console: navigate to "IAM & Admin" --> "Service Accounts" --> "Create Service Account", and use the values above. + +Via gcloud: +`gcloud iam service-accounts create ${service_account_id} --display-name="${display_name}" --description="${description}" --project=${gcp_project_id}` + +After creating the service account, note its numeric **Client ID** (unique ID) from the service account details page. You will need it when completing domain-wide delegation setup. + +See the page below for more information on provisioning Google Workspace connectors without Terraform: + +https://docs.worklytics.co/psoxy/sources/google-workspace#provisioning-api-clients-without-terraform diff --git a/infra/modules/worklytics-connectors-google-workspace/main.tf b/infra/modules/worklytics-connectors-google-workspace/main.tf index 95a41da4f1..0919676abc 100644 --- a/infra/modules/worklytics-connectors-google-workspace/main.tf +++ b/infra/modules/worklytics-connectors-google-workspace/main.tf @@ -1,6 +1,18 @@ locals { - provision_gcp_sa_keys = try(var.google_workspace_connector_settings["google_workspace_provision_keys"], var.provision_gcp_sa_keys) - gcp_sa_key_rotation_days = try(var.google_workspace_connector_settings["google_workspace_key_rotation_days"], var.gcp_sa_key_rotation_days) + provision_service_accounts = try(var.google_workspace_connector_settings["provision_service_accounts"], true) + enable_apis = try(var.google_workspace_connector_settings["enable_apis"], true) + provision_gcp_sa_keys = ( + local.provision_service_accounts + ? try(var.google_workspace_connector_settings["provision_keys"], var.provision_gcp_sa_keys) + : false + ) + gcp_sa_key_rotation_days = try(var.google_workspace_connector_settings["key_rotation_days"], var.gcp_sa_key_rotation_days) + + manual_steps_before_dwd = (local.enable_apis ? 0 : 1) + (local.provision_service_accounts ? 0 : 1) + dwd_todo_step = var.todo_step + local.manual_steps_before_dwd + api_todo_step = var.todo_step + sa_todo_step = var.todo_step + (local.enable_apis ? 0 : 1) + key_todo_step = local.dwd_todo_step + 1 } terraform { required_version = "~> 1.7" @@ -46,39 +58,97 @@ module "google_workspace_connection" { description = "Google API OAuth Client for ${each.value.display_name}" apis_consumed = each.value.apis_consumed oauth_scopes_needed = each.value.oauth_scopes_needed + provision_service_account = local.provision_service_accounts + enable_apis = local.enable_apis todos_as_local_files = var.todos_as_local_files - todo_step = var.todo_step + todo_step = local.dwd_todo_step } locals { + api_enable_todos = { + for id, connection in module.google_workspace_connection : + id => templatefile("${path.module}/gcp-api-enable-todo.tftpl", { + gcp_project_id : var.gcp_project_id + connector_id : id + apis_consumed : module.worklytics_connector_specs.enabled_google_workspace_connectors[id].apis_consumed + }) + } + + sa_creation_todos = { + for id, connection in module.google_workspace_connection : + id => templatefile("${path.module}/gcp-sa-create-todo.tftpl", { + gcp_project_id : var.gcp_project_id + connector_id : id + service_account_id : "${local.environment_id_prefix}${substr(id, 0, 30 - length(local.environment_id_prefix))}" + display_name : "Psoxy Connector - ${local.environment_id_display_name_qualifier}${module.worklytics_connector_specs.enabled_google_workspace_connectors[id].display_name}" + description : "Google API OAuth Client for ${module.worklytics_connector_specs.enabled_google_workspace_connectors[id].display_name}" + expected_service_account_email : connection.service_account_email + }) + } + key_creation_todos = { for id, connection in module.google_workspace_connection : id => templatefile("${path.module}/gcp-sa-key-create-todo.tftpl", { gcp_project_id : var.gcp_project_id, gcp_service_account : connection.service_account_email, secret_prefix : connection.instance_id }) } - todos = [for id, connection in module.google_workspace_connection : - local.provision_gcp_sa_keys ? connection.todo : "${local.key_creation_todos[id]}\n${connection.todo}" - ] + connector_todos = { + for id, connection in module.google_workspace_connection : + id => join("\n\n", [for part in [ + local.enable_apis ? null : local.api_enable_todos[id], + local.provision_service_accounts ? null : local.sa_creation_todos[id], + connection.todo, + local.provision_gcp_sa_keys ? null : local.key_creation_todos[id], + ] : part if part != null]) + } + + todos = [for id, connection in module.google_workspace_connection : local.connector_todos[id]] - current_todo_step = try(max(values(module.google_workspace_connection)[*].next_todo_step...), var.todo_step) + current_todo_step = try(max(values(module.google_workspace_connection)[*].next_todo_step...), local.dwd_todo_step) next_todo_step = local.provision_gcp_sa_keys ? local.current_todo_step : local.current_todo_step + 1 + connectors_needing_manual_api_enablement = { + for k, v in module.worklytics_connector_specs.enabled_google_workspace_connectors : + k => v + if !local.enable_apis + } + + connectors_needing_manual_sa_creation = { + for k, v in module.worklytics_connector_specs.enabled_google_workspace_connectors : + k => v + if !local.provision_service_accounts + } + service_accounts_tf_managed_keys = local.provision_gcp_sa_keys ? { for k, v in module.worklytics_connector_specs.enabled_google_workspace_connectors : k => module.google_workspace_connection[k].service_account_id } : {} - service_accounts_user_managed_keys = local.provision_gcp_sa_keys ? {} : { + service_accounts_user_managed_keys = { for k, v in module.worklytics_connector_specs.enabled_google_workspace_connectors : k => module.google_workspace_connection[k].service_account_id + if !local.provision_gcp_sa_keys } } +resource "local_file" "todo_gcp_api_enablement" { + for_each = var.todos_as_local_files ? local.connectors_needing_manual_api_enablement : {} + + filename = "TODO ${local.api_todo_step} - Enable APIs for ${each.key}.md" + content = local.api_enable_todos[each.key] +} + +resource "local_file" "todo_gcp_sa_creation" { + for_each = var.todos_as_local_files ? local.connectors_needing_manual_sa_creation : {} + + filename = "TODO ${local.sa_todo_step} - Create Service Account for ${each.key}.md" + content = local.sa_creation_todos[each.key] +} + resource "local_file" "todo_gcp_sa_key_creation" { for_each = var.todos_as_local_files ? local.service_accounts_user_managed_keys : {} - filename = "TODO ${local.current_todo_step} - Create Key for ${each.key}.md" + filename = "TODO ${local.key_todo_step} - Create Key for ${each.key}.md" content = local.key_creation_todos[each.key] } diff --git a/infra/modules/worklytics-connectors-google-workspace/variables.tf b/infra/modules/worklytics-connectors-google-workspace/variables.tf index dd039be8c4..2d46c69e39 100644 --- a/infra/modules/worklytics-connectors-google-workspace/variables.tf +++ b/infra/modules/worklytics-connectors-google-workspace/variables.tf @@ -50,7 +50,7 @@ variable "google_workspace_example_admin" { variable "provision_gcp_sa_keys" { type = bool - description = "whether to provision key for each connector's GCP Service Account (OAuth Client). If false, you must create the key manually and provide it." + description = "[DEPRECATED - use google_workspace_connector_settings map instead] whether to provision key for each connector's GCP Service Account (OAuth Client). If false, you must create the key manually and provide it. Ignored if service accounts are not provisioned by Terraform." default = true } @@ -80,6 +80,6 @@ variable "todo_step" { variable "google_workspace_connector_settings" { type = map(any) - description = "Map of configuration settings specifically for Google Workspace connectors (e.g. example users). Note that provider-controlling parameters (like GCP project IDs or impersonation SAs) remain top-level variables." + description = "Map of configuration settings specifically for Google Workspace connectors. Supported keys: example_user, example_admin, provision_keys, key_rotation_days, provision_service_accounts, enable_apis. Provider-controlling parameters (like GCP project IDs or impersonation SAs) remain top-level variables." default = {} } diff --git a/java/core/src/main/java/co/worklytics/psoxy/ProcessedDataMetadataFields.java b/java/core/src/main/java/co/worklytics/psoxy/ProcessedDataMetadataFields.java index 4af44fa3d3..4b60acbc55 100644 --- a/java/core/src/main/java/co/worklytics/psoxy/ProcessedDataMetadataFields.java +++ b/java/core/src/main/java/co/worklytics/psoxy/ProcessedDataMetadataFields.java @@ -1,12 +1,17 @@ package co.worklytics.psoxy; +import java.util.Arrays; +import java.util.Optional; + import lombok.NonNull; import lombok.RequiredArgsConstructor; /** - * metadata fields that Psoxy may add to processed data responses - * -- as HTTP headers on http responses - * -- as metadata if written to objects + * metadata fields that Psoxy may add to processed data responses. + * + *

On sync HTTP responses, only these fields are exposed as headers (via {@link #getHttpHeader()}). + * Request-capture metadata ({@link co.worklytics.psoxy.gateway.output.ApiDataOutputUtils.OutputObjectMetadata}) + * is stored on {@link co.worklytics.psoxy.gateway.ProcessedContent} for async/side outputs only. */ @RequiredArgsConstructor public enum ProcessedDataMetadataFields { @@ -51,4 +56,10 @@ public String getMetadataKey() { return formattedName; } + public static Optional fromMetadataKey(String metadataKey) { + return Arrays.stream(values()) + .filter(f -> f.getMetadataKey().equals(metadataKey)) + .findFirst(); + } + } diff --git a/java/core/src/main/java/co/worklytics/psoxy/gateway/BulkContentTypes.java b/java/core/src/main/java/co/worklytics/psoxy/gateway/BulkContentTypes.java new file mode 100644 index 0000000000..2fd6e23a1e --- /dev/null +++ b/java/core/src/main/java/co/worklytics/psoxy/gateway/BulkContentTypes.java @@ -0,0 +1,76 @@ +package co.worklytics.psoxy.gateway; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; +import java.util.Set; +import org.apache.hc.core5.http.ContentType; + +/** + * MIME types for bulk file processing and webhook batch merging. + * + *

Reuses {@link ContentType} from Apache HttpCore where available. Bulk formats that HttpCore + * does not define (CSV, Parquet, JSON Lines variants) are registered via {@link ContentType#create}. + */ +public final class BulkContentTypes { + + private BulkContentTypes() {} + + // HttpCore-defined + public static final ContentType JSON = ContentType.APPLICATION_JSON; + public static final ContentType NDJSON = ContentType.APPLICATION_NDJSON; + public static final ContentType FORM_URLENCODED = ContentType.APPLICATION_FORM_URLENCODED; + + // Not defined by HttpCore ContentType + public static final ContentType NDJSON_ALT = ContentType.create("application/ndjson"); + public static final ContentType JSONL = ContentType.create("application/jsonl"); + public static final ContentType JSONLINES = ContentType.create("application/jsonlines"); + public static final ContentType JSONLINES_ALT = ContentType.create("application/x-jsonlines"); + public static final ContentType CSV = ContentType.create("text/csv"); + public static final ContentType APPLICATION_CSV = ContentType.create("application/csv"); + public static final ContentType PARQUET = ContentType.create("application/vnd.apache.parquet"); + + /** Inferred CSV content type for object storage metadata (includes charset). */ + public static final String CSV_UTF8 = CSV.getMimeType() + "; charset=utf-8"; + + /** + * Content-Types that cloud consoles often attach to bulk uploads, but that do not reflect the + * file format. + */ + public static final Set KNOWN_GENERIC_UPLOAD_TYPES = Set.of( + FORM_URLENCODED.getMimeType() + ); + + /** + * Supported bulk Content-Type base values (parameters such as {@code charset} matched separately). + */ + public static final Set SUPPORTED_BULK_BASES = Set.of( + CSV.getMimeType(), + APPLICATION_CSV.getMimeType(), + JSON.getMimeType(), + NDJSON.getMimeType(), + NDJSON_ALT.getMimeType(), + JSONL.getMimeType(), + JSONLINES.getMimeType(), + JSONLINES_ALT.getMimeType(), + PARQUET.getMimeType() + ); + + /** + * Input content types that can be concatenated into newline-delimited JSON output. + */ + public static final Set MERGEABLE_JSON_RECORD_TYPES = Set.of( + JSON.getMimeType(), + NDJSON.getMimeType(), + NDJSON_ALT.getMimeType(), + JSONL.getMimeType(), + JSONLINES.getMimeType(), + JSONLINES_ALT.getMimeType() + ); + + public static String describeContentTypes(Set types) { + List sorted = new ArrayList<>(types); + Collections.sort(sorted); + return String.join(", ", sorted); + } +} diff --git a/java/core/src/main/java/co/worklytics/psoxy/gateway/ProcessedContent.java b/java/core/src/main/java/co/worklytics/psoxy/gateway/ProcessedContent.java index 02d32b33f7..8c2c4ba348 100644 --- a/java/core/src/main/java/co/worklytics/psoxy/gateway/ProcessedContent.java +++ b/java/core/src/main/java/co/worklytics/psoxy/gateway/ProcessedContent.java @@ -50,19 +50,27 @@ public class ProcessedContent implements Serializable { Map metadata = new HashMap<>(); /** - * the actual content + * the actual content; may be null when the upstream response has no body (e.g. 204, HEAD) */ + @Getter(lombok.AccessLevel.NONE) byte[] content; + /** + * Returns content bytes for consumers that need to read or write the body. + * Missing bodies are treated as an empty array to avoid NPEs in output writers. + */ + public byte[] getContent() { + return content != null ? content : new byte[0]; + } + /** * for convenience, a method to get the content as a string - rather than byte array - * @return the content as a string, using the specified contentCharset + * @return the content as a string, using the specified contentCharset; null if no body */ public String getContentAsString() { - if (getContent() == null) { + if (content == null) { return null; - } else { - return new String(getContent(), contentCharset); } + return new String(content, contentCharset); } } diff --git a/java/core/src/main/java/co/worklytics/psoxy/gateway/ProxyConstants.java b/java/core/src/main/java/co/worklytics/psoxy/gateway/ProxyConstants.java index 257b144963..293d20b625 100644 --- a/java/core/src/main/java/co/worklytics/psoxy/gateway/ProxyConstants.java +++ b/java/core/src/main/java/co/worklytics/psoxy/gateway/ProxyConstants.java @@ -22,7 +22,7 @@ public class ProxyConstants { /** * Version of the Java source code. Used to identify the version of the proxy. */ - public static final String JAVA_SOURCE_CODE_VERSION = "v0.6.6"; + public static final String JAVA_SOURCE_CODE_VERSION = "v0.6.7"; /** * a random UUID used to salt the hash of the salt. Purpose of this is to invalidate any non-purpose built rainbow table solution. diff --git a/java/core/src/main/java/co/worklytics/psoxy/gateway/TransientConfigException.java b/java/core/src/main/java/co/worklytics/psoxy/gateway/TransientConfigException.java new file mode 100644 index 0000000000..a6bc3eab5d --- /dev/null +++ b/java/core/src/main/java/co/worklytics/psoxy/gateway/TransientConfigException.java @@ -0,0 +1,19 @@ +package co.worklytics.psoxy.gateway; + +/** + * Signals that a config/secret backend had a transient failure (credential rotation, network + * blip, service hiccup) and the value may still be accessible on the next attempt. + * + * Distinct from a missing value ({@code Optional.empty()} / {@code NEGATIVE_VALUE}): callers + * should NOT treat this as "property not configured" — they should retry or serve a cached value. + */ +public class TransientConfigException extends RuntimeException { + + public TransientConfigException(String message, Throwable cause) { + super(message, cause); + } + + public TransientConfigException(String message) { + super(message); + } +} diff --git a/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/ApiDataRequestHandler.java b/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/ApiDataRequestHandler.java index c1b638d294..b3242505ca 100644 --- a/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/ApiDataRequestHandler.java +++ b/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/ApiDataRequestHandler.java @@ -405,6 +405,14 @@ public HttpEventResponse handle(HttpEventRequest requestToProxy, log.log(Level.WARNING, "Confirm oauth scopes set in config.yaml match those granted in data source"); return builder.build(); + } catch (co.worklytics.psoxy.gateway.TransientConfigException e) { + // Config store was temporarily unreachable (e.g. credential rotation, AWS hiccup). + // The proxy already retried internally; this is not a misconfiguration. + builder.statusCode(HttpStatus.SC_SERVICE_UNAVAILABLE); + builder.header(ProcessedDataMetadataFields.ERROR.getHttpHeader(), + ErrorCauses.CONFIGURATION_FAILURE.name()); + log.log(Level.WARNING, "Transient config store failure after retries: " + e.getMessage(), e); + return builder.build(); } catch (java.util.NoSuchElementException e) { // missing config, such as ACCESS_TOKEN builder.statusCode(HttpStatus.SC_INTERNAL_SERVER_ERROR); @@ -550,7 +558,11 @@ && isSafeMethod(requestToSourceApi.getRequestMethod())) { ProcessedContent original = apiDataOutputUtils .responseAsRawProcessedContent(requestToSourceApi, sourceApiResponse); try { - apiDataSideOutput.writeRaw(original, processingContext); + apiDataSideOutput.writeRaw( + original.toBuilder() + .metadata(apiDataOutputUtils.buildSourceApiRequestMetadata(requestToSourceApi)) + .build(), + processingContext); } catch (Output.WriteFailure e) { log.log(Level.WARNING, "Error writing to side output for original content", e); builder.multivaluedHeader( @@ -572,8 +584,8 @@ && isSafeMethod(requestToSourceApi.getRequestMethod())) { processingContext); } else { proxyResponseContent = sanitizationResult.getContentAsString(); - sanitizationResult.getMetadata().entrySet() - .forEach(e -> builder.header(e.getKey(), e.getValue())); + sanitizedApiResponseMetadata(sanitizationResult.getMetadata()) + .forEach((header, value) -> builder.header(header, value)); } @@ -616,17 +628,18 @@ && isSafeMethod(requestToSourceApi.getRequestMethod())) { } } - ProcessedContent sanitize(HttpEventRequest request, RequestUrls requestUrls, + ProcessedContent sanitize(HttpEventRequest requestToProxy, RequestUrls requestUrls, ProcessedContent originalContent) { - RESTApiSanitizer sanitizerForRequest = getSanitizerForRequest(request); + RESTApiSanitizer sanitizerForRequest = getSanitizerForRequest(requestToProxy); String sanitized = - StringUtils.trimToEmpty(sanitizerForRequest.sanitize(request.getHttpMethod(), + StringUtils.trimToEmpty(sanitizerForRequest.sanitize(requestToProxy.getHttpMethod(), requestUrls.getOriginal(), originalContent.getContentAsString())); String rulesSha = rulesUtils.sha(sanitizerForRequest.getRules()); log.info("response sanitized with rule set " + rulesSha); - Map metadata = new HashMap<>(originalContent.getMetadata()); + Map metadata = + new HashMap<>(apiDataOutputUtils.buildProxyRequestMetadata(requestToProxy)); metadata.put(ProcessedDataMetadataFields.RULES_SHA.getMetadataKey(), rulesSha); metadata.put(ProcessedDataMetadataFields.PROXY_VERSION.getMetadataKey(), ProxyConstants.JAVA_SOURCE_CODE_VERSION); @@ -756,6 +769,19 @@ static Set normalizeHeaders(Set headers) { .collect(Collectors.toUnmodifiableSet()); } + /** + * Filters processed-content metadata to fields intended for sync HTTP responses. + * Request-capture metadata remains on {@link ProcessedContent} for async/side outputs. + */ + @VisibleForTesting + static Map sanitizedApiResponseMetadata(Map metadata) { + return metadata.entrySet().stream() + .flatMap(e -> ProcessedDataMetadataFields.fromMetadataKey(e.getKey()) + .stream() + .map(f -> Map.entry(f.getHttpHeader(), e.getValue()))) + .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue)); + } + @SneakyThrows HttpRequestFactory getRequestFactory(HttpEventRequest request) { diff --git a/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/BatchMergeHandler.java b/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/BatchMergeHandler.java index dc9acc72d3..7348d292cd 100644 --- a/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/BatchMergeHandler.java +++ b/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/BatchMergeHandler.java @@ -4,13 +4,12 @@ import java.io.IOException; import java.io.UncheckedIOException; import java.util.Objects; -import java.util.Set; import java.util.concurrent.atomic.AtomicInteger; import java.util.logging.Level; import java.util.stream.Stream; import java.util.zip.GZIPOutputStream; import javax.inject.Inject; -import org.apache.hc.core5.http.ContentType; +import co.worklytics.psoxy.gateway.BulkContentTypes; import co.worklytics.psoxy.gateway.ProcessedContent; import co.worklytics.psoxy.gateway.impl.output.OutputUtils; import co.worklytics.psoxy.gateway.output.Output; @@ -32,14 +31,6 @@ @Log public class BatchMergeHandler { - // supported input content types - // eg, stuff that can be concatenated into ndjson output - // which is json, or other ndjson - public static final Set SUPPORTED_INPUT_CONTENT_TYPES = Set.of( - ContentType.APPLICATION_JSON.getMimeType(), - ContentType.APPLICATION_NDJSON.getMimeType() - ); - public static final String GZIP_CONTENT_ENCODING = "gzip"; // output @@ -75,8 +66,11 @@ public void handleBatch(Stream batch) { if (item.getContentType() == null) { throw new IllegalArgumentException("Batch items must have a content type"); } - if (!SUPPORTED_INPUT_CONTENT_TYPES.contains(item.getContentType())) { - throw new IllegalArgumentException("Batch items must have content type 'application/json' or 'application/x-ndjson'; was " + item.getContentType()); + if (!BulkContentTypes.MERGEABLE_JSON_RECORD_TYPES.contains(item.getContentType())) { + throw new IllegalArgumentException( + "Batch items must have one of the supported content types: " + + BulkContentTypes.describeContentTypes(BulkContentTypes.MERGEABLE_JSON_RECORD_TYPES) + + "; was " + item.getContentType()); } byte[] uncompressedContent; if (GZIP_CONTENT_ENCODING.equals(item.getContentEncoding())) { @@ -110,7 +104,7 @@ public void handleBatch(Stream batch) { ProcessedContent combined = ProcessedContent.builder() .contentEncoding(GZIP_CONTENT_ENCODING) .content(byteArrayOutputStream.toByteArray()) - .contentType(ContentType.APPLICATION_NDJSON.getMimeType()) // suggested, but not yet an official standard IANA type + .contentType(BulkContentTypes.NDJSON.getMimeType()) // suggested, but not yet an official standard IANA type .build(); outputUtils.forBatchedWebhookContent().write(combined); diff --git a/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecorator.java b/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecorator.java index a37139193a..961a9c84be 100644 --- a/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecorator.java +++ b/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecorator.java @@ -8,22 +8,53 @@ import java.util.concurrent.ExecutionException; import java.util.concurrent.TimeUnit; import java.util.stream.Collectors; + import com.google.common.annotations.VisibleForTesting; +import com.google.common.base.Ticker; import com.google.common.cache.CacheBuilder; import com.google.common.cache.CacheLoader; import com.google.common.cache.LoadingCache; +import com.google.common.util.concurrent.Futures; +import com.google.common.util.concurrent.ListenableFuture; +import com.google.common.util.concurrent.UncheckedExecutionException; import co.worklytics.psoxy.gateway.ConfigService; import co.worklytics.psoxy.gateway.SecretStore; +import co.worklytics.psoxy.gateway.TransientConfigException; import co.worklytics.psoxy.gateway.WritableConfigService; -import lombok.RequiredArgsConstructor; +import lombok.NonNull; import lombok.SneakyThrows; +import lombok.extern.java.Log; + +import java.util.logging.Level; -@RequiredArgsConstructor +@Log public class CachingConfigServiceDecorator implements WritableConfigService, SecretStore { + static final int MAX_TRANSIENT_RETRIES = 3; + static final long DEFAULT_TRANSIENT_RETRY_DELAY_MS = 500L; + final ConfigService delegate; final Duration defaultTtl; + final Ticker ticker; + final long transientRetryDelayMs; + + public CachingConfigServiceDecorator(ConfigService delegate, Duration defaultTtl) { + this(delegate, defaultTtl, Ticker.systemTicker(), DEFAULT_TRANSIENT_RETRY_DELAY_MS); + } + + @VisibleForTesting + CachingConfigServiceDecorator(ConfigService delegate, Duration defaultTtl, Ticker ticker) { + this(delegate, defaultTtl, ticker, DEFAULT_TRANSIENT_RETRY_DELAY_MS); + } + + @VisibleForTesting + CachingConfigServiceDecorator(ConfigService delegate, Duration defaultTtl, Ticker ticker, long transientRetryDelayMs) { + this.delegate = delegate; + this.defaultTtl = defaultTtl; + this.ticker = ticker; + this.transientRetryDelayMs = transientRetryDelayMs; + } private volatile LoadingCache cache; @@ -39,12 +70,55 @@ LoadingCache getCache() { if (this.cache == null) { this.cache = CacheBuilder.newBuilder() .maximumSize(100) - .expireAfterWrite(defaultTtl.getSeconds(), TimeUnit.SECONDS) + .ticker(ticker) + .refreshAfterWrite(defaultTtl.getSeconds(), TimeUnit.SECONDS) .recordStats() .build(new CacheLoader() { //req for java8-backwards compatibility @Override - public String load(ConfigProperty key) { - return delegate.getConfigPropertyAsOptional(key).orElse(NEGATIVE_VALUE); + public String load(@NonNull ConfigProperty key) { + TransientConfigException lastException = null; + for (int attempt = 0; attempt < MAX_TRANSIENT_RETRIES; attempt++) { + try { + return delegate.getConfigPropertyAsOptional(key).orElse(NEGATIVE_VALUE); + } catch (TransientConfigException e) { + lastException = e; + log.log(Level.WARNING, String.format("Transient failure on attempt {0}/{1} for config property {2}", + attempt + 1, MAX_TRANSIENT_RETRIES, key.name())); + } + try { + if (transientRetryDelayMs > 0) + Thread.sleep(transientRetryDelayMs); + } catch (InterruptedException ie) { + Thread.currentThread().interrupt(); + throw new TransientConfigException("Config load for " + key.name() + " interrupted during retry", ie); + } + } + throw lastException; + } + + @Override + public ListenableFuture reload(@NonNull ConfigProperty key, @NonNull String oldValue) { + try { + String newValue = delegate.getConfigPropertyAsOptional(key).orElse(NEGATIVE_VALUE); + // Fallback heuristic for backends that still swallow exceptions + // (e.g. GCP SecretManagerConfigService): if the value was valid + // before but now comes back empty, assume transient and retain. + if (NEGATIVE_VALUE.equals(newValue) && !NEGATIVE_VALUE.equals(oldValue)) { + log.log(Level.WARNING, + "Backend returned empty for config property {0} which was previously set; assuming transient failure and retaining cached value", + key.name()); + return Futures.immediateFuture(oldValue); + } + return Futures.immediateFuture(newValue); + } catch (TransientConfigException e) { + // Backend explicitly signalled a transient failure. + // Returning the old value resets the write-time so Guava waits a + // full TTL before retrying, rather than retrying on every request. + log.log(Level.WARNING, + "Transient failure reloading config property {0}; retaining cached value until next refresh cycle", + key.name()); + return Futures.immediateFuture(oldValue); + } } }); } @@ -85,8 +159,22 @@ public Optional getConfigPropertyAsOptional(ConfigProperty property) { } else { return Optional.of(value); } + } catch (UncheckedExecutionException e) { + // Guava wraps RuntimeExceptions from load() in UncheckedExecutionException. + // TransientConfigException is a RuntimeException, so it lands here. + Throwable cause = e.getCause(); + if (cause instanceof TransientConfigException) { + // load() retried MAX_TRANSIENT_RETRIES times and still failed. Nothing was + // cached, so the next request will retry immediately. Re-throw so callers can + // distinguish a transient store outage from a genuinely missing property. + log.log(Level.WARNING, + "Transient backend failure for config property {0}; all retries exhausted", + property.name()); + throw (TransientConfigException) cause; + } + throw (cause instanceof RuntimeException) ? (RuntimeException) cause : e; } catch (ExecutionException e) { - //unwrap if possible, re-throw + // Guava wraps checked exceptions from load() in ExecutionException. if (e.getCause() == null) { throw e; } else { diff --git a/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/output/CompressedOutputWrapper.java b/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/output/CompressedOutputWrapper.java index 3439a96dda..8dd1c0e84c 100644 --- a/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/output/CompressedOutputWrapper.java +++ b/java/core/src/main/java/co/worklytics/psoxy/gateway/impl/output/CompressedOutputWrapper.java @@ -33,9 +33,10 @@ public void write(ProcessedContent content) throws WriteFailure { @Override public void write(String key, ProcessedContent content) throws WriteFailure { try { - if (!Objects.equals(COMPRESSION_TYPE, content.getContentEncoding())) { + byte[] rawContent = content.getContent(); + if (!Objects.equals(COMPRESSION_TYPE, content.getContentEncoding())) { log.info("Compressing response with gzip encoding through wrapper"); - byte[] compressedContent = gzipContent(content.getContent()); + byte[] compressedContent = gzipContent(rawContent); content = content.withContentEncoding(COMPRESSION_TYPE).withContent(compressedContent); } delegate.write(key, content); diff --git a/java/core/src/main/java/co/worklytics/psoxy/gateway/output/ApiDataOutputUtils.java b/java/core/src/main/java/co/worklytics/psoxy/gateway/output/ApiDataOutputUtils.java index f33f70bb6c..0a3c048d54 100644 --- a/java/core/src/main/java/co/worklytics/psoxy/gateway/output/ApiDataOutputUtils.java +++ b/java/core/src/main/java/co/worklytics/psoxy/gateway/output/ApiDataOutputUtils.java @@ -3,10 +3,12 @@ import co.worklytics.psoxy.ControlHeader; import co.worklytics.psoxy.gateway.*; import co.worklytics.psoxy.gateway.impl.ApiDataRequestHandler; +import com.google.api.client.http.HttpContent; import com.google.api.client.http.HttpRequest; import com.google.api.client.http.HttpResponse; import lombok.AllArgsConstructor; import lombok.extern.java.Log; +import org.apache.commons.lang3.ArrayUtils; import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.tuple.Pair; import org.apache.http.HttpHeaders; @@ -32,9 +34,9 @@ public class ApiDataOutputUtils { /** * keys for metadata that will be added to API Data output objects. * - * in all of these cases, except API_HOST, the values will be as follows: - * - original (raw) case, will be the ACTUAL value sent with the request to the source API - * - sanitized case, will be the value as it was sent to the proxy, which is presumably sanitized to reduce exposure of sensitive data + * in all of these cases, the values will be as follows: + * - sanitized output: {@link #buildProxyRequestMetadata} from the client request to the proxy + * - raw side output: {@link #buildSourceApiRequestMetadata} from the request sent to the source API */ public enum OutputObjectMetadata { @@ -123,46 +125,76 @@ public ProcessedContent responseAsRawProcessedContent(HttpRequest sourceApiReque } } - Map metadata = this.buildRawMetadata(sourceApiRequest); + builder.metadata(Collections.emptyMap()); + return builder.build(); + } - //not sure this will work; are we certain to be able to consume HttpContent after request has been sent? - if (sourceApiRequest.getContent() != null - && sourceApiRequest.getContent().getLength() > 0) { - // if the request has a body, add it to metadata - try (ByteArrayOutputStream out = new ByteArrayOutputStream()) { - sourceApiRequest.getContent().writeTo(out); - metadata.put(OutputObjectMetadata.REQUEST_BODY.name(), - base64encoder.encodeToString(out.toByteArray())); - } catch (IOException e) { - log.log(Level.WARNING, "Error reading request body to fill in metadata; possibly bc request already sent, so some implementations we cannot re-read the content stream", e); - } - } + /** + * Request metadata from the client request to the proxy, for sanitized output. + */ + public Map buildProxyRequestMetadata(HttpEventRequest requestToProxy) { + Map metadata = new HashMap<>(); - builder.metadata(metadata); + requestToProxy.getHeaders().entrySet().stream() + .filter(entry -> this.isParameterHeader(entry.getKey())) + .forEach(entry -> metadata.put(entry.getKey(), String.join(",", entry.getValue()))); + requestToProxy.getHeader(HttpHeaders.HOST) + .filter(StringUtils::isNotBlank) + .ifPresent(host -> metadata.put(OutputObjectMetadata.API_HOST.name(), host)); + metadata.put(OutputObjectMetadata.HTTP_METHOD.name(), requestToProxy.getHttpMethod()); - return builder.build(); - } + String path = normalizePath(requestToProxy.getPath()); + if (StringUtils.isNotEmpty(path)) { + metadata.put(OutputObjectMetadata.PATH.name(), path); + } - Map buildRawMetadata(HttpRequest sourceApiRequest) { - HashMap metadata = new HashMap<>(); - metadata.put(ApiDataOutputUtils.OutputObjectMetadata.API_HOST.name(), sourceApiRequest.getUrl().getHost()); + requestToProxy.getQuery() + .filter(StringUtils::isNotBlank) + .ifPresent(query -> metadata.put(OutputObjectMetadata.QUERY_STRING.name(), query)); - // split rawPath into path and query string - String path = normalizePath(sourceApiRequest.getUrl().getRawPath()); - if (!StringUtils.isEmpty(path)) { - metadata.put(ApiDataOutputUtils.OutputObjectMetadata.PATH.name(), path); - } + Optional.ofNullable(requestToProxy.getBody()) + .filter(ArrayUtils::isNotEmpty) + .map(base64encoder::encodeToString) + .ifPresent(body -> metadata.put(OutputObjectMetadata.REQUEST_BODY.name(), body)); + + return metadata; + } + + /** + * Request metadata from the request sent to the source API, for raw side output. + */ + public Map buildSourceApiRequestMetadata(HttpRequest sourceApiRequest) { + Map metadata = new HashMap<>(); - Pair splitPathAndQuery = splitPathAndQuery(sourceApiRequest.getUrl().buildRelativeUrl()); + metadata.put(OutputObjectMetadata.API_HOST.name(), sourceApiRequest.getUrl().getHost()); + metadata.put(OutputObjectMetadata.HTTP_METHOD.name(), sourceApiRequest.getRequestMethod()); - if (splitPathAndQuery.getRight() != null) { - metadata.put(ApiDataOutputUtils.OutputObjectMetadata.QUERY_STRING.name(), canonicalQuery(Optional.of(splitPathAndQuery.getRight()))); + Pair splitPathAndQuery = + splitPathAndQuery(sourceApiRequest.getUrl().buildRelativeUrl()); + if (StringUtils.isNotEmpty(splitPathAndQuery.getLeft())) { + metadata.put(OutputObjectMetadata.PATH.name(), splitPathAndQuery.getLeft()); + } + if (StringUtils.isNotBlank(splitPathAndQuery.getRight())) { + metadata.put(OutputObjectMetadata.QUERY_STRING.name(), + canonicalQuery(Optional.of(splitPathAndQuery.getRight()))); } - metadata.put(ApiDataOutputUtils.OutputObjectMetadata.HTTP_METHOD.name(), sourceApiRequest.getRequestMethod()); + //not sure this will work; are we certain to be able to consume HttpContent after request has been sent? + try { + HttpContent content = sourceApiRequest.getContent(); + if (content != null && content.getLength() > 0) { + try (ByteArrayOutputStream out = new ByteArrayOutputStream()) { + content.writeTo(out); + metadata.put(OutputObjectMetadata.REQUEST_BODY.name(), + base64encoder.encodeToString(out.toByteArray())); + } + } + } catch (IOException e) { + log.log(Level.WARNING, "Error reading request body to fill in metadata; possibly bc request already sent, so some implementations we cannot re-read the content stream", e); + } return metadata; } @@ -177,37 +209,6 @@ private Pair splitPathAndQuery(String rawPath) { return Pair.of(path, query); } - /** - * builds metadata for output object based on request, which intended for writing to GCS/S3 metadata - * - * (Azure Blob Storage metadata support is more limited, so likely this will not work there) - * - * does NOT enforce platform-specific constraints on metadata keys/values; we leave it to the platform - * implementation to truncate/warn/encode as desired. - * - * @param requestToProxy - * @return - */ - Map buildMetadata(HttpEventRequest requestToProxy) { - - Map metadata = new HashMap<>(); - - requestToProxy.getHeaders().entrySet().stream() - .filter(entry -> this.isParameterHeader(entry.getKey())) - .forEach(entry -> metadata.put(entry.getKey(), String.join(",", entry.getValue()))); - - metadata.put(OutputObjectMetadata.HTTP_METHOD.name(), requestToProxy.getHttpMethod()); - metadata.put(OutputObjectMetadata.PATH.name(), requestToProxy.getPath()); - requestToProxy.getQuery().ifPresent(query -> metadata.put(OutputObjectMetadata.QUERY_STRING.name(), query)); - - Optional.ofNullable(requestToProxy.getBody()) - .map(base64encoder::encodeToString) - .ifPresent(body -> metadata.put(OutputObjectMetadata.REQUEST_BODY.name(), body)); - - return metadata; - } - - final static Set HEADERS_TO_IGNORE = Set.of( HttpHeaders.HOST, HttpHeaders.USER_AGENT, diff --git a/java/core/src/main/java/co/worklytics/psoxy/gateway/output/OutputToApiDataSideOutputAdapter.java b/java/core/src/main/java/co/worklytics/psoxy/gateway/output/OutputToApiDataSideOutputAdapter.java index 71c8228260..1178f1e1ad 100644 --- a/java/core/src/main/java/co/worklytics/psoxy/gateway/output/OutputToApiDataSideOutputAdapter.java +++ b/java/core/src/main/java/co/worklytics/psoxy/gateway/output/OutputToApiDataSideOutputAdapter.java @@ -33,9 +33,6 @@ public void writeRaw(ProcessedContent content, public void writeSanitized(ProcessedContent sanitizedContent, ApiDataRequestHandler.ProcessingContext processingContext) throws IOException { String key = apiDataOutputUtils.buildSanitizedOutputKey(processingContext); - - // TODO: enforce no sensitive data in sanitized output metadata ?? - wrappedOutput.write(key, sanitizedContent); } diff --git a/java/core/src/main/java/co/worklytics/psoxy/storage/StorageHandler.java b/java/core/src/main/java/co/worklytics/psoxy/storage/StorageHandler.java index 3c200c8d3f..575bbcf6dd 100644 --- a/java/core/src/main/java/co/worklytics/psoxy/storage/StorageHandler.java +++ b/java/core/src/main/java/co/worklytics/psoxy/storage/StorageHandler.java @@ -19,7 +19,6 @@ import java.util.Map; import java.util.Objects; import java.util.Optional; -import java.util.Set; import java.util.function.Supplier; import java.util.stream.Collectors; import java.util.zip.GZIPInputStream; @@ -38,6 +37,7 @@ import com.google.common.annotations.VisibleForTesting; import co.worklytics.psoxy.Pseudonymizer; import co.worklytics.psoxy.gateway.BulkModeConfigProperty; +import co.worklytics.psoxy.gateway.BulkContentTypes; import co.worklytics.psoxy.gateway.ConfigService; import co.worklytics.psoxy.gateway.HostEnvironment; import co.worklytics.psoxy.gateway.ProxyConfigProperty; @@ -70,34 +70,6 @@ public class StorageHandler { public static final String CONTENT_ENCODING_GZIP = "gzip"; public static final String EXTENSION_GZIP = ".gz"; - private static final String CONTENT_TYPE_FORM_URLENCODED = "application/x-www-form-urlencoded"; - private static final String CONTENT_TYPE_CSV = "text/csv"; - private static final String CONTENT_TYPE_APPLICATION_CSV = "application/csv"; - private static final String CONTENT_TYPE_JSON = "application/json"; - private static final String CONTENT_TYPE_NDJSON = "application/x-ndjson"; - private static final String CONTENT_TYPE_NDJSON_ALT = "application/ndjson"; - private static final String CONTENT_TYPE_PARQUET = "application/vnd.apache.parquet"; - private static final String CONTENT_TYPE_CSV_UTF8 = CONTENT_TYPE_CSV + "; charset=utf-8"; - - /** - * Well-defined Content-Types that cloud consoles often attach to bulk uploads, but that do not - * reflect the file format (e.g. AWS S3 console may use {@code application/x-www-form-urlencoded} - * for arbitrary objects). When the object path implies a bulk format, extension-inferred types - * are preferred over these values. - */ - private static final Set KNOWN_GENERIC_CONTENT_TYPES = Set.of( - CONTENT_TYPE_FORM_URLENCODED - ); - - private static final Set SUPPORTED_BULK_CONTENT_TYPE_BASES = Set.of( - CONTENT_TYPE_CSV, - CONTENT_TYPE_APPLICATION_CSV, - CONTENT_TYPE_JSON, - CONTENT_TYPE_NDJSON, - CONTENT_TYPE_NDJSON_ALT, - CONTENT_TYPE_PARQUET - ); - // gzip magic number bytes (RFC 1952) private static final int GZIP_MAGIC_BYTE_1 = 0x1f; private static final int GZIP_MAGIC_BYTE_2 = 0x8b; @@ -151,7 +123,7 @@ static void warnIfEncodingDoesNotMatchFilename(@NonNull StorageEventRequest requ *

    *
  1. Use object metadata when it is a supported bulk type (including parameters such as * {@code charset}).
  2. - *
  3. When metadata is absent, or is a {@link #KNOWN_GENERIC_CONTENT_TYPES known generic} + *
  4. When metadata is absent, or is a {@link BulkContentTypes#KNOWN_GENERIC_UPLOAD_TYPES known generic} * upload type, prefer a type inferred from the file extension when recognized.
  5. *
  6. Otherwise use the metadata value when present.
  7. *
@@ -183,12 +155,12 @@ String effectiveContentType(@NonNull String sourceObjectPath, @Nullable String s private boolean isKnownGenericContentType(@Nullable String contentType) { String base = baseContentType(contentType); - return base != null && KNOWN_GENERIC_CONTENT_TYPES.contains(base); + return base != null && BulkContentTypes.KNOWN_GENERIC_UPLOAD_TYPES.contains(base); } private boolean matchesSupportedBulkContentType(@Nullable String contentType) { String base = baseContentType(contentType); - return base != null && SUPPORTED_BULK_CONTENT_TYPE_BASES.contains(base); + return base != null && BulkContentTypes.SUPPORTED_BULK_BASES.contains(base); } @Nullable @@ -207,13 +179,13 @@ private Optional inferContentTypeFromObjectPath(@NonNull String sourceOb String inferredContentType = null; if (path.endsWith(".ndjson") || path.endsWith(".jsonl")) { - inferredContentType = CONTENT_TYPE_NDJSON; + inferredContentType = BulkContentTypes.NDJSON.getMimeType(); } else if (path.endsWith(".csv")) { - inferredContentType = CONTENT_TYPE_CSV_UTF8; + inferredContentType = BulkContentTypes.CSV_UTF8; } else if (path.endsWith(".json")) { - inferredContentType = CONTENT_TYPE_JSON; + inferredContentType = BulkContentTypes.JSON.getMimeType(); } else if (path.endsWith(".parquet")) { - inferredContentType = CONTENT_TYPE_PARQUET; + inferredContentType = BulkContentTypes.PARQUET.getMimeType(); } return Optional.ofNullable(inferredContentType); } diff --git a/java/core/src/test/java/co/worklytics/psoxy/gateway/BulkContentTypesTest.java b/java/core/src/test/java/co/worklytics/psoxy/gateway/BulkContentTypesTest.java new file mode 100644 index 0000000000..25f7438664 --- /dev/null +++ b/java/core/src/test/java/co/worklytics/psoxy/gateway/BulkContentTypesTest.java @@ -0,0 +1,18 @@ +package co.worklytics.psoxy.gateway; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +import java.util.Set; +import org.junit.jupiter.api.Test; + +class BulkContentTypesTest { + + @Test + void describeContentTypes_sortsForStableOutput() { + assertEquals( + "application/json, application/x-ndjson", + BulkContentTypes.describeContentTypes(Set.of( + BulkContentTypes.NDJSON.getMimeType(), + BulkContentTypes.JSON.getMimeType()))); + } +} diff --git a/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/ApiDataRequestHandlerTest.java b/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/ApiDataRequestHandlerTest.java index 476fae3585..025f207b52 100644 --- a/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/ApiDataRequestHandlerTest.java +++ b/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/ApiDataRequestHandlerTest.java @@ -2,6 +2,7 @@ import co.worklytics.psoxy.ConfigRulesModule; import co.worklytics.psoxy.ControlHeader; +import co.worklytics.psoxy.ProcessedDataMetadataFields; import co.worklytics.psoxy.Pseudonymizer; import co.worklytics.psoxy.PseudonymizerImplFactory; import co.worklytics.psoxy.PsoxyModule; @@ -11,6 +12,7 @@ import co.worklytics.psoxy.gateway.HttpEventRequest; import co.worklytics.psoxy.gateway.HttpEventResponse; import co.worklytics.psoxy.gateway.ProxyConfigProperty; +import co.worklytics.psoxy.gateway.output.ApiDataOutputUtils; import co.worklytics.psoxy.impl.RESTApiSanitizerImpl; import co.worklytics.psoxy.rules.RESTRules; import co.worklytics.psoxy.rules.RulesUtils; @@ -614,6 +616,32 @@ void getSanitizerForRequest() { .getPseudonymImplementation()); } + @Test + void sanitizedApiResponseMetadata_includesOnlyProcessedDataFields() { + Map metadata = Map.of( + ProcessedDataMetadataFields.RULES_SHA.getMetadataKey(), "abc123", + ProcessedDataMetadataFields.PROXY_VERSION.getMetadataKey(), "v0.6.7", + ApiDataOutputUtils.OutputObjectMetadata.REQUEST_BODY.name(), "base64body", + ApiDataOutputUtils.OutputObjectMetadata.QUERY_STRING.name(), "foo=bar", + ApiDataOutputUtils.OutputObjectMetadata.PATH.name(), "users/me" + ); + + Map responseMetadata = + ApiDataRequestHandler.sanitizedApiResponseMetadata(metadata); + + assertEquals("abc123", + responseMetadata.get(ProcessedDataMetadataFields.RULES_SHA.getHttpHeader())); + assertEquals("v0.6.7", + responseMetadata.get(ProcessedDataMetadataFields.PROXY_VERSION.getHttpHeader())); + assertEquals(2, responseMetadata.size()); + assertFalse(responseMetadata.containsKey( + ApiDataOutputUtils.OutputObjectMetadata.REQUEST_BODY.name())); + assertFalse(responseMetadata.containsKey( + ApiDataOutputUtils.OutputObjectMetadata.QUERY_STRING.name())); + assertFalse(responseMetadata.containsKey( + ApiDataOutputUtils.OutputObjectMetadata.PATH.name())); + } + @Test void testHeadersPassThrough() throws IOException { setup("gmail", "google.apis.com"); diff --git a/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecoratorTest.java b/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecoratorTest.java index 3f8c00e075..705afcb8c8 100644 --- a/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecoratorTest.java +++ b/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecoratorTest.java @@ -13,6 +13,10 @@ import java.util.Map; import java.util.NoSuchElementException; import java.util.Optional; +import java.util.concurrent.TimeUnit; + +import com.google.common.base.Ticker; +import co.worklytics.psoxy.gateway.TransientConfigException; import static org.junit.jupiter.api.Assertions.*; @@ -72,6 +76,98 @@ void putConfigProperty() { localHashMapConfigService.getConfigPropertyOrError(TestConfigProperties.EXAMPLE_PROPERTY)); } + @Test + void retainsStaleValueOnTransientReloadFailure() { + FakeTicker ticker = new FakeTicker(); + ToggleableConfigService delegate = new ToggleableConfigService(); + delegate.putConfigProperty(TestConfigProperties.EXAMPLE_PROPERTY, "valid_token"); + + CachingConfigServiceDecorator cache = + new CachingConfigServiceDecorator(delegate, Duration.ofMinutes(1), ticker); + + // initial load succeeds + assertEquals(Optional.of("valid_token"), + cache.getConfigPropertyAsOptional(TestConfigProperties.EXAMPLE_PROPERTY)); + assertEquals(1, delegate.getReads()); + + // simulate transient SSM failure and advance past TTL + delegate.setSimulateFailure(true); + ticker.advance(2, TimeUnit.MINUTES); + + // reload fails silently; old value is retained — caller sees no error + assertEquals(Optional.of("valid_token"), + cache.getConfigPropertyAsOptional(TestConfigProperties.EXAMPLE_PROPERTY)); + assertEquals(2, delegate.getReads()); // reload was attempted + + // SSM recovers; advance past TTL again + delegate.setSimulateFailure(false); + ticker.advance(2, TimeUnit.MINUTES); + + // next reload succeeds; value still valid + assertEquals(Optional.of("valid_token"), + cache.getConfigPropertyAsOptional(TestConfigProperties.EXAMPLE_PROPERTY)); + assertEquals(3, delegate.getReads()); + } + + @Test + void retainsStaleValueOnExplicitTransientException() { + FakeTicker ticker = new FakeTicker(); + ThrowingConfigService delegate = new ThrowingConfigService("valid_token"); + + CachingConfigServiceDecorator cache = + new CachingConfigServiceDecorator(delegate, Duration.ofMinutes(1), ticker); + + // initial load succeeds + assertEquals(Optional.of("valid_token"), + cache.getConfigPropertyAsOptional(TestConfigProperties.EXAMPLE_PROPERTY)); + assertEquals(1, delegate.getReads()); + + // backend starts throwing TransientConfigException + delegate.setThrowTransient(true); + ticker.advance(2, TimeUnit.MINUTES); + + // reload catches TransientConfigException; old value retained, no error to caller + assertEquals(Optional.of("valid_token"), + cache.getConfigPropertyAsOptional(TestConfigProperties.EXAMPLE_PROPERTY)); + assertEquals(2, delegate.getReads()); + + // backend recovers + delegate.setThrowTransient(false); + ticker.advance(2, TimeUnit.MINUTES); + + assertEquals(Optional.of("valid_token"), + cache.getConfigPropertyAsOptional(TestConfigProperties.EXAMPLE_PROPERTY)); + assertEquals(3, delegate.getReads()); + } + + @Test + void transientExceptionOnColdStartDoesNotCacheNegativeValue() { + FakeTicker ticker = new FakeTicker(); + ThrowingConfigService delegate = new ThrowingConfigService("valid_token"); + delegate.setThrowTransient(true); + + // 0ms delay so the retry loop is fast in tests + CachingConfigServiceDecorator cache = + new CachingConfigServiceDecorator(delegate, Duration.ofMinutes(1), ticker, 0L); + + // cold start with transient error: retries MAX_TRANSIENT_RETRIES times then throws — + // nothing is cached as NEGATIVE_VALUE, so the next request will retry immediately + assertThrows(TransientConfigException.class, + () -> cache.getConfigPropertyAsOptional(TestConfigProperties.EXAMPLE_PROPERTY)); + assertEquals(CachingConfigServiceDecorator.MAX_TRANSIENT_RETRIES, delegate.getReads()); + + // second request: still failing, retries again from scratch (nothing was cached) + assertThrows(TransientConfigException.class, + () -> cache.getConfigPropertyAsOptional(TestConfigProperties.EXAMPLE_PROPERTY)); + assertEquals(CachingConfigServiceDecorator.MAX_TRANSIENT_RETRIES * 2, delegate.getReads()); + + // backend recovers — no TTL advance needed since nothing was ever cached + delegate.setThrowTransient(false); + assertEquals(Optional.of("valid_token"), + cache.getConfigPropertyAsOptional(TestConfigProperties.EXAMPLE_PROPERTY)); + assertEquals(CachingConfigServiceDecorator.MAX_TRANSIENT_RETRIES * 2 + 1, delegate.getReads()); + } + @Test void getConfigProperty_noCache() { assertTrue(config.getConfigPropertyAsOptional(TestConfigProperties.NO_CACHE).isEmpty()); @@ -91,6 +187,88 @@ void getConfigProperty_noCache() { } + static class FakeTicker extends Ticker { + private long nanos = 0; + + @Override + public long read() { + return nanos; + } + + void advance(long amount, TimeUnit unit) { + nanos += unit.toNanos(amount); + } + } + + static class ThrowingConfigService implements WritableConfigService { + private String value; + private boolean throwTransient = false; + + @Getter + private int reads = 0; + + ThrowingConfigService(String value) { + this.value = value; + } + + void setThrowTransient(boolean throwTransient) { + this.throwTransient = throwTransient; + } + + @Override + public void putConfigProperty(ConfigProperty property, String newValue) { + this.value = newValue; + } + + @Override + public String getConfigPropertyOrError(ConfigProperty property) { + return getConfigPropertyAsOptional(property) + .orElseThrow(() -> new NoSuchElementException("no value for " + property)); + } + + @Override + public Optional getConfigPropertyAsOptional(ConfigProperty property) { + reads++; + if (throwTransient) { + throw new TransientConfigException("simulated transient failure"); + } + return Optional.ofNullable(value); + } + } + + static class ToggleableConfigService implements WritableConfigService { + private final Map map = new HashMap<>(); + + @Getter + private int reads = 0; + + private boolean simulateFailure = false; + + void setSimulateFailure(boolean simulateFailure) { + this.simulateFailure = simulateFailure; + } + + @Override + public void putConfigProperty(ConfigProperty property, String value) { + map.put(property, value); + } + + @Override + public String getConfigPropertyOrError(ConfigProperty property) { + return getConfigPropertyAsOptional(property) + .orElseThrow(() -> new NoSuchElementException("no value for " + property)); + } + + @Override + public Optional getConfigPropertyAsOptional(ConfigProperty property) { + reads++; + if (simulateFailure) { + return Optional.empty(); + } + return Optional.ofNullable(map.get(property)); + } + } + static class LocalHashMapConfigService implements WritableConfigService { Map map = new HashMap<>(); diff --git a/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/output/CompressedOutputWrapperTest.java b/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/output/CompressedOutputWrapperTest.java index dfc0fab064..84f95fc208 100644 --- a/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/output/CompressedOutputWrapperTest.java +++ b/java/core/src/test/java/co/worklytics/psoxy/gateway/impl/output/CompressedOutputWrapperTest.java @@ -1,5 +1,8 @@ package co.worklytics.psoxy.gateway.impl.output; +import co.worklytics.psoxy.gateway.ProcessedContent; +import co.worklytics.psoxy.gateway.output.Output; + import org.junit.jupiter.api.Test; import java.io.BufferedReader; @@ -13,6 +16,25 @@ class CompressedOutputWrapperTest { + @Test + void writeNullContent() throws Exception { + Output delegate = new NoOutput(); + CompressedOutputWrapper wrapper = CompressedOutputWrapper.wrap(delegate); + ProcessedContent content = ProcessedContent.builder().content(null).build(); + assertDoesNotThrow(() -> wrapper.write(content)); + } + + @Test + void writeNullContentAlreadyGzipEncoded() throws Exception { + Output delegate = new NoOutput(); + CompressedOutputWrapper wrapper = CompressedOutputWrapper.wrap(delegate); + ProcessedContent content = ProcessedContent.builder() + .content(null) + .contentEncoding(CompressedOutputWrapper.COMPRESSION_TYPE) + .build(); + assertDoesNotThrow(() -> wrapper.write(content)); + } + @Test void gzipContent() throws Exception { String content = "Hello, world!"; diff --git a/java/core/src/test/java/co/worklytics/psoxy/gateway/output/ApiDataOutputUtilsTest.java b/java/core/src/test/java/co/worklytics/psoxy/gateway/output/ApiDataOutputUtilsTest.java index 0079c21b3e..bfb68f160b 100644 --- a/java/core/src/test/java/co/worklytics/psoxy/gateway/output/ApiDataOutputUtilsTest.java +++ b/java/core/src/test/java/co/worklytics/psoxy/gateway/output/ApiDataOutputUtilsTest.java @@ -11,8 +11,10 @@ import static org.junit.jupiter.api.Assertions.*; import static org.mockito.Mockito.*; +import static co.worklytics.psoxy.gateway.output.ApiDataOutputUtils.OutputObjectMetadata.*; import co.worklytics.psoxy.gateway.HttpEventRequest; import co.worklytics.psoxy.gateway.ProcessedContent; +import com.google.api.client.http.GenericUrl; import com.google.api.client.http.HttpRequest; import com.google.api.client.http.HttpResponse; @@ -21,6 +23,7 @@ import java.nio.charset.StandardCharsets; import java.time.Clock; import java.time.Instant; +import java.time.ZoneOffset; import java.util.*; class ApiDataOutputUtilsTest { @@ -32,7 +35,7 @@ class ApiDataOutputUtilsTest { public void setup() { utils = new ApiDataOutputUtils(mock(ApiModeConfig.class), mock(ConfigService.class), () -> UUID.fromString("123e4567-e89b-12d3-a456-426614174000"), Base64.getEncoder()); - clock = Clock.fixed(Instant.parse("2024-10-01T10:15:30Z"), java.time.ZoneOffset.UTC); + clock = Clock.fixed(Instant.parse("2024-10-01T10:15:30Z"), ZoneOffset.UTC); } @@ -67,7 +70,7 @@ void testBuildSanitizedOutputKey() { @Test void responseAsRawProcessedContent() throws Exception { HttpRequest mockRequest = mock(HttpRequest.class); - when(mockRequest.getUrl()).thenReturn(new com.google.api.client.http.GenericUrl("https://api.example.com/v1/resource")); + when(mockRequest.getUrl()).thenReturn(new GenericUrl("https://api.example.com/v1/resource")); when(mockRequest.getRequestMethod()).thenReturn("GET"); HttpResponse mockResponse = mock(HttpResponse.class); when(mockResponse.getContentType()).thenReturn("application/json"); @@ -78,42 +81,50 @@ void responseAsRawProcessedContent() throws Exception { ProcessedContent processed = utils.responseAsRawProcessedContent(mockRequest, mockResponse); assertEquals("application/json", processed.getContentType()); assertArrayEquals(contentBytes, processed.getContent()); + assertTrue(processed.getMetadata().isEmpty()); } @Test - void buildRawMetadata() { + void buildSourceApiRequestMetadata() { HttpRequest mockRequest = mock(HttpRequest.class); - com.google.api.client.http.GenericUrl url = new com.google.api.client.http.GenericUrl("https://api.example.com/v1/resource?foo=bar"); + GenericUrl url = new GenericUrl("https://api.example.com/v1/resource?foo=bar"); when(mockRequest.getUrl()).thenReturn(url); when(mockRequest.getRequestMethod()).thenReturn("POST"); - Map metadata = utils.buildRawMetadata(mockRequest); - assertEquals("api.example.com", metadata.get("API_HOST")); - assertEquals("v1/resource", metadata.get("PATH")); - assertEquals("POST", metadata.get("HTTP_METHOD")); - assertTrue(metadata.get(ApiDataOutputUtils.OutputObjectMetadata.QUERY_STRING.name()).contains("foo=bar")); + Map metadata = utils.buildSourceApiRequestMetadata(mockRequest); + assertEquals("api.example.com", metadata.get(API_HOST.name())); + assertEquals("v1/resource", metadata.get(PATH.name())); + assertEquals("POST", metadata.get(HTTP_METHOD.name())); + assertTrue(metadata.get(QUERY_STRING.name()).contains("foo=bar")); } @Test - void buildMetadata() { + void buildProxyRequestMetadata() { HttpEventRequest mockRequest = MockModules.provideMock(HttpEventRequest.class); Map> headers = new HashMap<>(); headers.put("Authorization", List.of("Bearer token")); + headers.put("Host", List.of("proxy.example.com")); when(mockRequest.getHeaders()).thenReturn(headers); + when(mockRequest.getHeader("Host")).thenReturn(Optional.of("proxy.example.com")); when(mockRequest.getHttpMethod()).thenReturn("GET"); when(mockRequest.getPath()).thenReturn("v1/resource"); - when(mockRequest.getQuery()).thenReturn(Optional.of("foo=bar")); - when(mockRequest.getBody()).thenReturn("body".getBytes()); - Map metadata = utils.buildMetadata(mockRequest); - assertEquals("GET", metadata.get("HTTP_METHOD")); - assertEquals("v1/resource", metadata.get("PATH")); - assertTrue(metadata.get("QUERY_STRING").contains("foo=bar")); - assertNotNull(metadata.get("REQUEST_BODY")); + when(mockRequest.getQuery()).thenReturn(Optional.of("email=tokenized-value")); + when(mockRequest.getBody()).thenReturn("{\"email\":\"tokenized-value\"}".getBytes(StandardCharsets.UTF_8)); + + Map metadata = utils.buildProxyRequestMetadata(mockRequest); + + assertEquals("proxy.example.com", metadata.get(API_HOST.name())); + assertEquals("GET", metadata.get(HTTP_METHOD.name())); + assertEquals("v1/resource", metadata.get(PATH.name())); + assertEquals("email=tokenized-value", metadata.get(QUERY_STRING.name())); + assertEquals(Base64.getEncoder().encodeToString( + "{\"email\":\"tokenized-value\"}".getBytes(StandardCharsets.UTF_8)), + metadata.get(REQUEST_BODY.name())); } @Test void responseAsRawProcessedContent_nullCharsetDoesNotThrow() throws Exception { HttpRequest mockRequest = mock(HttpRequest.class); - when(mockRequest.getUrl()).thenReturn(new com.google.api.client.http.GenericUrl("https://api.example.com/v1/resource")); + when(mockRequest.getUrl()).thenReturn(new GenericUrl("https://api.example.com/v1/resource")); when(mockRequest.getRequestMethod()).thenReturn("GET"); HttpResponse mockResponse = mock(HttpResponse.class); when(mockResponse.getContentType()).thenReturn("application/json"); @@ -130,7 +141,7 @@ void responseAsRawProcessedContent_nullCharsetDoesNotThrow() throws Exception { @Test void responseAsRawProcessedContent_nullContentDoesNotThrow() throws Exception { HttpRequest mockRequest = mock(HttpRequest.class); - when(mockRequest.getUrl()).thenReturn(new com.google.api.client.http.GenericUrl("https://api.example.com/v1/resource")); + when(mockRequest.getUrl()).thenReturn(new GenericUrl("https://api.example.com/v1/resource")); when(mockRequest.getRequestMethod()).thenReturn("GET"); HttpResponse mockResponse = mock(HttpResponse.class); when(mockResponse.getContentType()).thenReturn(null); diff --git a/java/core/src/test/java/co/worklytics/psoxy/storage/StorageHandlerTest.java b/java/core/src/test/java/co/worklytics/psoxy/storage/StorageHandlerTest.java index b80030a107..174fada09c 100644 --- a/java/core/src/test/java/co/worklytics/psoxy/storage/StorageHandlerTest.java +++ b/java/core/src/test/java/co/worklytics/psoxy/storage/StorageHandlerTest.java @@ -34,6 +34,7 @@ import com.fasterxml.jackson.databind.ObjectMapper; import com.google.common.collect.ImmutableMap; import co.worklytics.psoxy.PsoxyModule; +import co.worklytics.psoxy.gateway.BulkContentTypes; import co.worklytics.psoxy.gateway.BulkModeConfigProperty; import co.worklytics.psoxy.gateway.ConfigService; import co.worklytics.psoxy.gateway.ProxyConfigProperty; @@ -439,29 +440,29 @@ void buildRequest_replacesGenericContentTypeWithInferred() { "items.ndjson", handler.buildDefaultTransform(), null, - "application/x-www-form-urlencoded"); + BulkContentTypes.FORM_URLENCODED.getMimeType()); - assertEquals("application/x-ndjson", request.getContentType()); + assertEquals(BulkContentTypes.NDJSON.getMimeType(), request.getContentType()); } @Test void effectiveContentType_replacesGenericWithInferred() { assertEquals( - "application/x-ndjson", - handler.effectiveContentType("items.ndjson", "application/x-www-form-urlencoded")); + BulkContentTypes.NDJSON.getMimeType(), + handler.effectiveContentType("items.ndjson", BulkContentTypes.FORM_URLENCODED.getMimeType())); assertEquals( - "text/csv; charset=utf-8", - handler.effectiveContentType("data.csv", "application/x-www-form-urlencoded")); + BulkContentTypes.CSV_UTF8, + handler.effectiveContentType("data.csv", BulkContentTypes.FORM_URLENCODED.getMimeType())); assertEquals( - "application/json", - handler.effectiveContentType("export/file.json", "application/x-www-form-urlencoded")); + BulkContentTypes.JSON.getMimeType(), + handler.effectiveContentType("export/file.json", BulkContentTypes.FORM_URLENCODED.getMimeType())); } @Test void effectiveContentType_preservesSupportedSourceType() { assertEquals( - "text/csv; charset=us-ascii", - handler.effectiveContentType("data.csv", "text/csv; charset=us-ascii")); + BulkContentTypes.CSV.getMimeType() + "; charset=us-ascii", + handler.effectiveContentType("data.csv", BulkContentTypes.CSV.getMimeType() + "; charset=us-ascii")); } @Test @@ -474,7 +475,40 @@ void effectiveContentType_preservesNonGenericNonSupportedSourceType() { @Test void effectiveContentType_infersWhenAbsent() { assertEquals( - "application/x-ndjson", + BulkContentTypes.NDJSON.getMimeType(), handler.effectiveContentType("items.ndjson", null)); + assertEquals( + BulkContentTypes.NDJSON.getMimeType(), + handler.effectiveContentType("items.jsonl", null)); + } + + @Test + void effectiveContentType_preservesJsonLinesContentType() { + assertEquals( + BulkContentTypes.JSONLINES.getMimeType(), + handler.effectiveContentType("items.jsonl", BulkContentTypes.JSONLINES.getMimeType())); + assertEquals( + BulkContentTypes.JSONLINES_ALT.getMimeType(), + handler.effectiveContentType("export/file.jsonl", BulkContentTypes.JSONLINES_ALT.getMimeType())); + assertEquals( + BulkContentTypes.JSONL.getMimeType(), + handler.effectiveContentType("items.jsonl", BulkContentTypes.JSONL.getMimeType())); + } + + @Test + void buildRequest_replacesGenericContentTypeWithInferredForJsonl() { + when(config.getConfigPropertyAsOptional(eq(BulkModeConfigProperty.INPUT_BASE_PATH))) + .thenReturn(Optional.empty()); + when(config.getConfigPropertyAsOptional(eq(BulkModeConfigProperty.OUTPUT_BASE_PATH))) + .thenReturn(Optional.empty()); + + StorageEventRequest request = handler.buildRequest( + "bucket-in", + "items.jsonl", + handler.buildDefaultTransform(), + null, + BulkContentTypes.FORM_URLENCODED.getMimeType()); + + assertEquals(BulkContentTypes.NDJSON.getMimeType(), request.getContentType()); } } diff --git a/java/core/src/test/java/co/worklytics/psoxy/storage/impl/RecordBulkDataSanitizerImplTest.java b/java/core/src/test/java/co/worklytics/psoxy/storage/impl/RecordBulkDataSanitizerImplTest.java index a294ca2438..54dd84c42f 100644 --- a/java/core/src/test/java/co/worklytics/psoxy/storage/impl/RecordBulkDataSanitizerImplTest.java +++ b/java/core/src/test/java/co/worklytics/psoxy/storage/impl/RecordBulkDataSanitizerImplTest.java @@ -346,6 +346,43 @@ void testAutoFormat_NdjsonWithUnreliableContentTypeUsesFileExtension() { assertEquals(new String(TestUtils.getData("bulk/example-sanitized.ndjson"), StandardCharsets.UTF_8), output); } + @Test + void testAutoFormat_JsonlWithUnreliableContentTypeUsesFileExtension() { + this.setUpWithRules("---\n" + + "format: \"AUTO\"\n" + + "transforms:\n" + + "- redact: \"foo\"\n" + + "- pseudonymize: \"bar\"\n"); + + final String objectPath = "export-20231128/items.jsonl"; + storageHandler.handle(BulkDataTestUtils.request(objectPath) + .withContentType("application/x-www-form-urlencoded"), + BulkDataTestUtils.transform(rules), + BulkDataTestUtils.inputStreamSupplier("bulk/example.ndjson"), + outputStreamSupplier); + + String output = new String(outputStream.toByteArray(), StandardCharsets.UTF_8); + assertEquals(new String(TestUtils.getData("bulk/example-sanitized.ndjson"), StandardCharsets.UTF_8), output); + } + + @Test + void testAutoFormat_JsonlContentType() { + this.setUpWithRules("---\n" + + "format: \"AUTO\"\n" + + "transforms:\n" + + "- redact: \"foo\"\n" + + "- pseudonymize: \"bar\"\n"); + + storageHandler.handle(BulkDataTestUtils.request("export-20231128/items.jsonl") + .withContentType("application/jsonlines"), + BulkDataTestUtils.transform(rules), + BulkDataTestUtils.inputStreamSupplier("bulk/example.ndjson"), + outputStreamSupplier); + + String output = new String(outputStream.toByteArray(), StandardCharsets.UTF_8); + assertEquals(new String(TestUtils.getData("bulk/example-sanitized.ndjson"), StandardCharsets.UTF_8), output); + } + @Test void testAutoFormat_JsonArrayWithoutContentTypeUsesFileExtension() { this.setUpWithRules("---\n" + diff --git a/java/gateway-core/src/test/java/com/avaulta/gateway/rules/PathTemplateUtilsTest.java b/java/gateway-core/src/test/java/com/avaulta/gateway/rules/PathTemplateUtilsTest.java index 3e6916554e..b6e268d88b 100644 --- a/java/gateway-core/src/test/java/com/avaulta/gateway/rules/PathTemplateUtilsTest.java +++ b/java/gateway-core/src/test/java/com/avaulta/gateway/rules/PathTemplateUtilsTest.java @@ -96,6 +96,10 @@ public void capturesOptionalCorrectly(String template, String path, String expec "{exportId}/events{shardIndex}.ndjson{suffix?},export123/events0-1775063099227.ndjson,true", // matches .ndjson.gz "{exportId}/events{shardIndex}.ndjson{suffix?},export123/events0-1775019339023.ndjson.gz,true", + // matches .jsonl (JSON Lines equivalent) + "{exportId}/events{shardIndex}.jsonl{suffix?},export123/events0-1775063099227.jsonl,true", + // matches .jsonl.gz + "{exportId}/events{shardIndex}.jsonl{suffix?},export123/events0-1775019339023.jsonl.gz,true", // doesn't match wrong base name "{exportId}/events{shardIndex}.ndjson{suffix?},export123/items0.ndjson,false", }) diff --git a/java/impl/aws/src/main/java/co/worklytics/psoxy/aws/AwsExceptionUtils.java b/java/impl/aws/src/main/java/co/worklytics/psoxy/aws/AwsExceptionUtils.java new file mode 100644 index 0000000000..a23e259934 --- /dev/null +++ b/java/impl/aws/src/main/java/co/worklytics/psoxy/aws/AwsExceptionUtils.java @@ -0,0 +1,14 @@ +package co.worklytics.psoxy.aws; + +import software.amazon.awssdk.awscore.exception.AwsServiceException; + +class AwsExceptionUtils { + + static boolean isAccessDenied(AwsServiceException e) { + if (e.awsErrorDetails() == null) { + return false; + } + String code = e.awsErrorDetails().errorCode(); + return code != null && (code.contains("AccessDenied") || code.contains("Forbidden")); + } +} diff --git a/java/impl/aws/src/main/java/co/worklytics/psoxy/aws/ParameterStoreConfigService.java b/java/impl/aws/src/main/java/co/worklytics/psoxy/aws/ParameterStoreConfigService.java index f6ce84d7e9..eef806fb9f 100644 --- a/java/impl/aws/src/main/java/co/worklytics/psoxy/aws/ParameterStoreConfigService.java +++ b/java/impl/aws/src/main/java/co/worklytics/psoxy/aws/ParameterStoreConfigService.java @@ -3,6 +3,7 @@ import co.worklytics.psoxy.gateway.ConfigService; import co.worklytics.psoxy.gateway.LockService; import co.worklytics.psoxy.gateway.SecretStore; +import co.worklytics.psoxy.gateway.TransientConfigException; import co.worklytics.psoxy.gateway.impl.EnvVarsConfigService; import co.worklytics.psoxy.utils.DevLogUtils; import co.worklytics.psoxy.utils.RandomNumberGenerator; @@ -137,11 +138,17 @@ Optional getConfigPropertyAsOptional(ConfigProperty property, Function Optional getConfigPropertyAsOptional(ConfigProperty property, Function Optional getConfigPropertyAsOptional(ConfigProperty property, Function getQuery() { - - Stream> paramStream = Stream.concat( - Optional.ofNullable(event.getQueryStringParameters()) + // API Gateway payload 1.0 populates both queryStringParameters and multiValueQueryStringParameters + // with the same parameters; prefer multiValue to preserve repeated keys without duplication. + Stream> paramStream; + if (event.getMultiValueQueryStringParameters() == null + || event.getMultiValueQueryStringParameters().isEmpty()) { + paramStream = Optional.ofNullable(event.getQueryStringParameters()) .map(params -> params.entrySet().stream() .map(k -> Pair.of(k.getKey(), k.getValue()))) - .orElse(Stream.empty()), - Optional.ofNullable(event.getMultiValueQueryStringParameters()) - .map(params -> params.entrySet().stream() - .flatMap(k -> k.getValue().stream().map(v -> Pair.of(k.getKey(), v)))) - .orElse(Stream.empty()) - ); + .orElse(Stream.empty()); + } else { + paramStream = event.getMultiValueQueryStringParameters().entrySet().stream() + .flatMap(entry -> Optional.ofNullable(entry.getValue()).orElse(List.of()).stream() + .map(v -> Pair.of(entry.getKey(), v))); + } return Optional.ofNullable(StringUtils.trimToNull(paramStream - .map(pair -> pair.getLeft() + "=" + pair.getRight()) - .collect(Collectors.joining("&")))); + .map(pair -> pair.getLeft() + "=" + pair.getRight()) + .collect(Collectors.joining("&")))); } @Override diff --git a/java/impl/aws/src/test/java/co/worklytics/psoxy/aws/request/APIGatewayV1ProxyEventRequestAdapterTest.java b/java/impl/aws/src/test/java/co/worklytics/psoxy/aws/request/APIGatewayV1ProxyEventRequestAdapterTest.java index 67ecee78c6..edfdfce87d 100644 --- a/java/impl/aws/src/test/java/co/worklytics/psoxy/aws/request/APIGatewayV1ProxyEventRequestAdapterTest.java +++ b/java/impl/aws/src/test/java/co/worklytics/psoxy/aws/request/APIGatewayV1ProxyEventRequestAdapterTest.java @@ -3,6 +3,8 @@ import co.worklytics.test.TestUtils; import com.amazonaws.services.lambda.runtime.events.APIGatewayProxyRequestEvent; import com.fasterxml.jackson.databind.ObjectMapper; +import java.util.LinkedHashMap; +import java.util.List; import java.util.Map; import lombok.SneakyThrows; import org.junit.jupiter.api.Test; @@ -46,13 +48,68 @@ public void parse_interesting() { assertEquals("/something", requestAdapter.getPath()); assertTrue(requestAdapter.getQuery().isPresent()); - - assertEquals("name=John", requestAdapter.getQuery().get()); + assertEquals("name=John", requestAdapter.getQuery().orElseThrow()); assertFalse(requestAdapter.isHttps().isPresent()); } + @Test + public void getQuery_doesNotDuplicateWhenBothQueryMapsPopulated() { + Map queryStringParameters = new LinkedHashMap<>(); + queryStringParameters.put("$select", "id,mail,otherMails"); + queryStringParameters.put("$top", "50"); + + Map> multiValueQueryStringParameters = new LinkedHashMap<>(); + multiValueQueryStringParameters.put("$select", List.of("id,mail,otherMails")); + multiValueQueryStringParameters.put("$top", List.of("50")); + + APIGatewayProxyRequestEvent apiGatewayEvent = new APIGatewayProxyRequestEvent() + .withQueryStringParameters(queryStringParameters) + .withMultiValueQueryStringParameters(multiValueQueryStringParameters); + + APIGatewayV1ProxyEventRequestAdapter requestAdapter = + APIGatewayV1ProxyEventRequestAdapter.of(apiGatewayEvent); + + assertTrue(requestAdapter.getQuery().isPresent()); + assertEquals("$select=id,mail,otherMails&$top=50", requestAdapter.getQuery().orElseThrow()); + } + + @Test + public void getQuery_preservesRepeatedKeysFromMultiValueMap() { + Map queryStringParameters = new LinkedHashMap<>(); + queryStringParameters.put("$select", "id,mail,otherMails"); + queryStringParameters.put("$top", "50"); + + Map> multiValueQueryStringParameters = new LinkedHashMap<>(); + multiValueQueryStringParameters.put("$select", List.of("id", "mail", "otherMails")); + multiValueQueryStringParameters.put("$top", List.of("50")); + + APIGatewayProxyRequestEvent apiGatewayEvent = new APIGatewayProxyRequestEvent() + .withQueryStringParameters(queryStringParameters) + .withMultiValueQueryStringParameters(multiValueQueryStringParameters); + + APIGatewayV1ProxyEventRequestAdapter requestAdapter = + APIGatewayV1ProxyEventRequestAdapter.of(apiGatewayEvent); + + assertTrue(requestAdapter.getQuery().isPresent()); + assertEquals("$select=id&$select=mail&$select=otherMails&$top=50", + requestAdapter.getQuery().orElseThrow()); + } + + @SneakyThrows + @Test + public void getQuery_fallsBackToSingleValueMapWhenMultiValueAbsent() { + APIGatewayProxyRequestEvent apiGatewayEvent = objectMapper.readerFor(APIGatewayProxyRequestEvent.class) + .readValue(TestUtils.getData("lambda-proxy-events/api-gateway-v1-example_interesting.json")); + + APIGatewayV1ProxyEventRequestAdapter requestAdapter = + APIGatewayV1ProxyEventRequestAdapter.of(apiGatewayEvent); + + assertTrue(requestAdapter.getQuery().isPresent()); + assertEquals("name=John", requestAdapter.getQuery().orElseThrow()); + } + @SneakyThrows @Test public void parse_payload1_from_api_gateway_v2() { diff --git a/java/impl/gcp/src/main/java/co/worklytics/psoxy/GCSOutput.java b/java/impl/gcp/src/main/java/co/worklytics/psoxy/GCSOutput.java index 8943e38041..5c2965a0ba 100644 --- a/java/impl/gcp/src/main/java/co/worklytics/psoxy/GCSOutput.java +++ b/java/impl/gcp/src/main/java/co/worklytics/psoxy/GCSOutput.java @@ -42,8 +42,10 @@ public GCSOutput(@Assisted OutputLocation location) { @Override public void write(String key, ProcessedContent content) throws WriteFailure { + byte[] body = content.getContent(); + if (key == null) { - key = DigestUtils.md5Hex(content.getContent()); + key = DigestUtils.md5Hex(body); } try { @@ -58,7 +60,9 @@ public void write(String key, ProcessedContent content) throws WriteFailure { .setContentEncoding(content.getContentEncoding()) .setMetadata(metadata) .build())) { - writeChannel.write(java.nio.ByteBuffer.wrap(content.getContent(), 0, content.getContent().length)); + if (body.length > 0) { + writeChannel.write(java.nio.ByteBuffer.wrap(body)); + } } } catch (Exception e) { log.log(Level.WARNING, "Failed to write to GCS sideOutput", e); diff --git a/java/pom.xml b/java/pom.xml index 987ed9ef38..de6f32e925 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -10,7 +10,7 @@ pom - 0.6.6 + 0.6.7 UTF-8 1.18.42 2.40.5 diff --git a/lychee.toml b/lychee.toml new file mode 100644 index 0000000000..25f121e4ef --- /dev/null +++ b/lychee.toml @@ -0,0 +1,27 @@ +# Link checker configuration for docs/ +# Validates fully-qualified http(s) URLs found in docs/. Relative links are not checked +# (no base_url is set, and non-http(s) links are excluded via workflow args). + +verbose = "info" +no_progress = true + +# Do not extract links from fenced code blocks or inline code. +include_verbatim = false + +scheme = ["https", "http"] + +accept = ["200", "204", "302", "403", "429"] + +max_redirects = 20 +timeout = 20 + +# Skip localhost URLs (see also --exclude-loopback in the workflow). +exclude_loopback = true + +# Skip relative markdown links. +exclude = [ + '^/', + '^\\./', + '^\\.\\./', + '^mailto:', +] diff --git a/tools/init-tfvars.sh b/tools/init-tfvars.sh index 2d85ad7834..685d97aa11 100755 --- a/tools/init-tfvars.sh +++ b/tools/init-tfvars.sh @@ -7,7 +7,7 @@ PSOXY_BASE_DIR=$2 DEPLOYMENT_ENV=${3:-"local"} HOST_PLATFORM=${4:-"aws"} -SCRIPT_VERSION="v0.6.6" +SCRIPT_VERSION="v0.6.7" if [ -z "$PSOXY_BASE_DIR" ]; then printf "Usage: init-tfvars.sh [DEPLOYMENT_ENV]\n" diff --git a/tools/lib/deployment-bundle.sh b/tools/lib/deployment-bundle.sh index 9276a41687..e69162b99a 100644 --- a/tools/lib/deployment-bundle.sh +++ b/tools/lib/deployment-bundle.sh @@ -159,15 +159,60 @@ deployment_bundle_public_path() { esac } +deployment_bundle_s3_parts() { + local bundle_path="$1" + + if [[ ! "$bundle_path" =~ ^s3://([^/]+)/(.+)$ ]]; then + return 1 + fi + + DEPLOYMENT_BUNDLE_S3_BUCKET="${BASH_REMATCH[1]}" + DEPLOYMENT_BUNDLE_S3_KEY="${BASH_REMATCH[2]}" + DEPLOYMENT_BUNDLE_S3_REGION="us-east-1" + if [[ "$DEPLOYMENT_BUNDLE_S3_BUCKET" =~ ^psoxy-public-artifacts-(.+)$ ]]; then + DEPLOYMENT_BUNDLE_S3_REGION="${BASH_REMATCH[1]}" + fi +} + +deployment_bundle_s3_to_http_url() { + local bundle_path="$1" + local bucket key region + + if ! deployment_bundle_s3_parts "$bundle_path"; then + return 1 + fi + bucket="$DEPLOYMENT_BUNDLE_S3_BUCKET" + key="$DEPLOYMENT_BUNDLE_S3_KEY" + region="$DEPLOYMENT_BUNDLE_S3_REGION" + + printf 'https://%s.s3.%s.amazonaws.com/%s' "$bucket" "$region" "$key" +} + deployment_bundle_public_exists() { local bundle_path="$1" case "$bundle_path" in s3://*) - if ! command -v aws >/dev/null 2>&1; then - return 1 + if command -v aws >/dev/null 2>&1 && deployment_bundle_s3_parts "$bundle_path"; then + # head-object only needs object-level access; s3 ls requires s3:ListBucket on the bucket + if aws s3api head-object \ + --bucket "$DEPLOYMENT_BUNDLE_S3_BUCKET" \ + --key "$DEPLOYMENT_BUNDLE_S3_KEY" \ + --region "$DEPLOYMENT_BUNDLE_S3_REGION" \ + >/dev/null 2>&1; then + return 0 + fi + fi + if command -v curl >/dev/null 2>&1; then + local http_url="" http_code="" + if http_url="$(deployment_bundle_s3_to_http_url "$bundle_path")"; then + # follow redirects, then require a final 2xx (3xx alone is not success) + http_code="$(curl -sSIL --max-redirs 5 -o /dev/null -w "%{http_code}" "$http_url" 2>/dev/null)" || return 1 + [[ "$http_code" =~ ^2[0-9]{2}$ ]] + return $? + fi fi - aws s3 ls "$bundle_path" >/dev/null 2>&1 + return 1 ;; gs://*) if command -v gsutil >/dev/null 2>&1; then diff --git a/tools/psoxy-test/data-sources/spec.js b/tools/psoxy-test/data-sources/spec.js index 4ff7c0b2c0..ebdafbe836 100644 --- a/tools/psoxy-test/data-sources/spec.js +++ b/tools/psoxy-test/data-sources/spec.js @@ -358,6 +358,7 @@ export default { { name: 'Users', path: '/v1.0/users', + params: { $select: 'id,mail,otherMails' }, refs: [ { name: 'Events', @@ -393,6 +394,7 @@ export default { { name: 'Users', path: '/beta/users', + params: { $select: 'id,mail,otherMails' }, refs: [ { name: 'Mailbox Settings', diff --git a/tools/psoxy-test/lib/aws.js b/tools/psoxy-test/lib/aws.js index 7587748350..01625336fe 100644 --- a/tools/psoxy-test/lib/aws.js +++ b/tools/psoxy-test/lib/aws.js @@ -274,6 +274,7 @@ async function upload(bucket, key, file, options, client) { const ext = baseName.slice(baseName.lastIndexOf('.')).toLowerCase(); const MIME_TYPES = { '.ndjson': 'application/x-ndjson', + '.jsonl': 'application/jsonlines', '.json': 'application/json', '.csv': 'text/csv', '.tsv': 'text/tab-separated-values', diff --git a/tools/psoxy-test/lib/gcp.js b/tools/psoxy-test/lib/gcp.js index 92e519ba5a..7db6639aaf 100644 --- a/tools/psoxy-test/lib/gcp.js +++ b/tools/psoxy-test/lib/gcp.js @@ -320,6 +320,7 @@ async function upload(bucketName, filePath, client, filename) { const ext = baseName.slice(baseName.lastIndexOf('.')).toLowerCase(); const MIME_TYPES = { '.ndjson': 'application/x-ndjson', + '.jsonl': 'application/jsonlines', '.json': 'application/json', '.csv': 'text/csv', '.tsv': 'text/tab-separated-values', diff --git a/tools/psoxy-test/lib/utils.js b/tools/psoxy-test/lib/utils.js index 7e4d29f3ee..d10f50b5af 100644 --- a/tools/psoxy-test/lib/utils.js +++ b/tools/psoxy-test/lib/utils.js @@ -570,11 +570,11 @@ async function signJwtWithGCPKMS(claims, keyId) { /** * Known compound extensions where the "real" extension precedes a compression - * extension (e.g. `.ndjson.gz`, `.csv.gz`). We treat the entire compound + * extension (e.g. `.ndjson.gz`, `.jsonl.gz`, `.csv.gz`). We treat the entire compound * extension as one unit so the suffix is inserted before it: * `events0.ndjson.gz` + timestamp → `events0-.ndjson.gz` */ -const COMPOUND_EXTENSIONS = ['.ndjson.gz', '.csv.gz', '.json.gz', '.tsv.gz', '.parquet.gz']; +const COMPOUND_EXTENSIONS = ['.ndjson.gz', '.jsonl.gz', '.csv.gz', '.json.gz', '.tsv.gz', '.parquet.gz']; /** * Append suffix to filename (before extension) diff --git a/tools/psoxy-test/test/utils.test.js b/tools/psoxy-test/test/utils.test.js index 34253b46fc..a602c2b201 100644 --- a/tools/psoxy-test/test/utils.test.js +++ b/tools/psoxy-test/test/utils.test.js @@ -123,6 +123,8 @@ test('Add filename suffix', (t) => { // Compound extensions: suffix goes before the full compound extension t.is(addFilenameSuffix('events0.ndjson.gz', 1775019339023), 'events0-1775019339023.ndjson.gz'); + t.is(addFilenameSuffix('events0.jsonl.gz', 1775019339023), + 'events0-1775019339023.jsonl.gz'); t.is(addFilenameSuffix('data.csv.gz', 'bar'), 'data-bar.csv.gz'); t.is(addFilenameSuffix('folder/test/data.ndjson.gz', 1701711533220), 'data-1701711533220.ndjson.gz'); diff --git a/tools/release/qa/apply-example.sh b/tools/release/qa/apply-example.sh new file mode 100755 index 0000000000..a5567d752c --- /dev/null +++ b/tools/release/qa/apply-example.sh @@ -0,0 +1,55 @@ +#!/bin/bash +# Non-interactive terraform apply for a dev example, saving plan + apply logs. +# Usage: ./tools/release/qa/apply-example.sh [force_bundle] +# Example: ./tools/release/qa/apply-example.sh aws v0.6.6 true + +set -euo pipefail + +COLORSCHEME_SH="$(dirname "$0")/../../set-term-colorscheme.sh" +if [ -f "$COLORSCHEME_SH" ]; then + # shellcheck source=/dev/null + source "$COLORSCHEME_SH" +else + ERR='\033[0;31m'; SUCCESS='\033[0;32m'; WARN='\033[1;33m'; INFO='\033[0;34m'; NC='\033[0m' +fi + +EXAMPLE="${1:-}" +RELEASE="${2:-}" +FORCE_BUNDLE="${3:-true}" + +if [ -z "$EXAMPLE" ] || [ -z "$RELEASE" ]; then + printf "${ERR}Usage: %s [force_bundle]${NC}\n" "$0" + exit 1 +fi + +if [ "$EXAMPLE" != "aws" ] && [ "$EXAMPLE" != "gcp" ]; then + printf "${ERR}Example must be 'aws' or 'gcp'.${NC}\n" + exit 1 +fi + +ROOT="$(git rev-parse --show-toplevel)" +EXAMPLE_DIR="${ROOT}/infra/examples-dev/${EXAMPLE}" +DATE_STAMP="$(date +%Y%m%d)" +PLAN_LOG="${EXAMPLE_DIR}/${DATE_STAMP}_${EXAMPLE}-${RELEASE}-plan.txt" +APPLY_LOG="${EXAMPLE_DIR}/${DATE_STAMP}_${EXAMPLE}-${RELEASE}-apply.txt" + +if [ ! -d "$EXAMPLE_DIR" ]; then + printf "${ERR}Example directory not found: %s${NC}\n" "$EXAMPLE_DIR" + exit 1 +fi + +cd "$EXAMPLE_DIR" + +printf "Running ${INFO}terraform plan${NC} for ${INFO}%s${NC} (release %s) ...\n" "$EXAMPLE" "$RELEASE" +printf "Plan log: ${INFO}%s${NC}\n" "$PLAN_LOG" + +terraform plan -var="force_bundle=${FORCE_BUNDLE}" -no-color 2>&1 | tee "$PLAN_LOG" + +printf "\n${WARN}Review the plan log above before continuing.${NC}\n" +printf "Applying ${INFO}%s${NC} with force_bundle=%s ...\n" "$EXAMPLE" "$FORCE_BUNDLE" +printf "Apply log: ${INFO}%s${NC}\n" "$APPLY_LOG" + +terraform apply -auto-approve -var="force_bundle=${FORCE_BUNDLE}" -no-color 2>&1 | tee "$APPLY_LOG" + +printf "\n${SUCCESS}Apply completed for %s.${NC}\n" "$EXAMPLE" +printf "Logs:\n plan: %s\n apply: %s\n" "$PLAN_LOG" "$APPLY_LOG" diff --git a/tools/release/qa/run-example-tests.sh b/tools/release/qa/run-example-tests.sh new file mode 100755 index 0000000000..cee74ddbbf --- /dev/null +++ b/tools/release/qa/run-example-tests.sh @@ -0,0 +1,46 @@ +#!/bin/bash +# Run test-all.sh for a dev example and capture output. +# Usage: ./tools/release/qa/run-example-tests.sh +# Example: ./tools/release/qa/run-example-tests.sh aws v0.6.6 + +set -euo pipefail + +COLORSCHEME_SH="$(dirname "$0")/../../set-term-colorscheme.sh" +if [ -f "$COLORSCHEME_SH" ]; then + # shellcheck source=/dev/null + source "$COLORSCHEME_SH" +else + ERR='\033[0;31m'; SUCCESS='\033[0;32m'; WARN='\033[1;33m'; INFO='\033[0;34m'; NC='\033[0m' +fi + +EXAMPLE="${1:-}" +RELEASE="${2:-}" + +if [ -z "$EXAMPLE" ] || [ -z "$RELEASE" ]; then + printf "${ERR}Usage: %s ${NC}\n" "$0" + exit 1 +fi + +if [ "$EXAMPLE" != "aws" ] && [ "$EXAMPLE" != "gcp" ]; then + printf "${ERR}Example must be 'aws' or 'gcp'.${NC}\n" + exit 1 +fi + +ROOT="$(git rev-parse --show-toplevel)" +EXAMPLE_DIR="${ROOT}/infra/examples-dev/${EXAMPLE}" +DATE_STAMP="$(date +%Y%m%d)" +OUTPUT_FILE="${EXAMPLE_DIR}/${DATE_STAMP}_${EXAMPLE}-${RELEASE}-tests.txt" + +if [ ! -f "${EXAMPLE_DIR}/test-all.sh" ]; then + printf "${ERR}test-all.sh not found in %s${NC}\n" "$EXAMPLE_DIR" + exit 1 +fi + +cd "$EXAMPLE_DIR" +printf "Running ${INFO}./test-all.sh${NC} for ${INFO}%s${NC} ...\n" "$EXAMPLE" +printf "Output: ${INFO}%s${NC}\n" "$OUTPUT_FILE" + +./test-all.sh 2>&1 | tee "$OUTPUT_FILE" + +printf "\n${SUCCESS}Tests completed for %s.${NC}\n" "$EXAMPLE" +printf "Output: %s\n" "$OUTPUT_FILE" diff --git a/tools/release/qa/summarize-connector-tests.sh b/tools/release/qa/summarize-connector-tests.sh new file mode 100755 index 0000000000..c508299de3 --- /dev/null +++ b/tools/release/qa/summarize-connector-tests.sh @@ -0,0 +1,181 @@ +#!/bin/bash +# Summarize connector test output from test-all.sh into markdown + checklist metadata. +# Usage: ./tools/release/qa/summarize-connector-tests.sh [release] +# Writes: .summary.md and .checklist +# Prints the markdown summary to stdout. + +set -euo pipefail + +CLOUD="${1:-}" +INPUT="${2:-}" +RELEASE="${3:-}" + +if [ -z "$CLOUD" ] || [ -z "$INPUT" ]; then + echo "Usage: $0 [release]" >&2 + exit 1 +fi + +if [ ! -f "$INPUT" ]; then + echo "File not found: $INPUT" >&2 + exit 1 +fi + +SUMMARY_FILE="${INPUT}.summary.md" +CHECKLIST_FILE="${INPUT}.checklist" + +# Strip ANSI color codes for parsing. +CLEAN_INPUT="$(mktemp)" +trap 'rm -f "$CLEAN_INPUT"' EXIT +sed -E 's/\x1B\[[0-9;]*[[:alpha:]]//g' "$INPUT" > "$CLEAN_INPUT" + +python3 - "$CLOUD" "$CLEAN_INPUT" "$RELEASE" "$SUMMARY_FILE" "$CHECKLIST_FILE" <<'PY' +import re +import sys +from pathlib import Path + +cloud, input_path, release, summary_path, checklist_path = sys.argv[1:6] +text = Path(input_path).read_text(errors="replace") + +MSFT = {"azure-ad", "outlook-cal", "msft-teams"} +GOOGLE = {"gcal", "gdirectory", "google-chat", "gmail", "gemini-in-workspace-apps"} +TOKEN = { + "asana", "slack-analytics", "chatgpt-enterprise", "cursor", "zoom", + "jira-cloud", "github", "github-copilot", +} +ASYNC = {"slack-analytics"} +WEBHOOK = {"llm-portal"} +BULK = {"hris", "metrics", "workdata-generic"} + +CATEGORY_LABELS = { + "microsoft": "Microsoft API connector", + "google_workspace": "Google Workspace API connector", + "token": "Token-based API connector", + "async": "API connector with async", + "webhook": "Webhook collector", + "bulk": "Bulk connector", +} + + +def normalize_name(raw: str) -> str: + raw = re.sub(r"\[[0-9;]*m", "", raw).strip() + raw = re.sub(r"\s+\.\.\.$", "", raw).strip() + for prefix in ("dev-erik-awsall-", "psoxy-dev-erik-"): + if raw.startswith(prefix): + raw = raw[len(prefix):] + return raw + + +def connector_status(block: str, name: str) -> str: + lower = block.lower() + if name in BULK: + if "file downloaded" in lower and "file uploaded" in lower: + return "pass" + return "fail" + if name in WEBHOOK: + if "verification successful" in lower: + return "pass" + return "fail" + + health_ok = bool(re.search(r"health check result:\s*ok", lower)) + health_fail = bool(re.search(r"health check result:\s*precondition failed", lower)) + call_ok = bool(re.search(r"call result:\s*ok", lower)) + call_fail = bool(re.search(r"call result:.*(error|failed)", lower)) + missing = re.findall(r'"missingConfigProperties":\s*\[\s*"([^"]+)"', block) + + if name in ASYNC and "async response content" in lower and health_ok and call_ok: + return "pass" + + if health_ok and call_ok: + return "pass" + if health_ok and call_fail: + return "partial" + if health_fail or missing: + return "fail" + if call_fail: + return "fail" + return "unknown" + + +parts = re.split(r"(?=Quick test of )", text) +connectors = [] +for part in parts: + m = re.match(r"Quick test of (.+)", part) + if not m: + continue + name = normalize_name(m.group(1)) + status = connector_status(part, name) + connectors.append((name, status)) + +if not connectors: + print(f"No connector tests found in {input_path}", file=sys.stderr) + sys.exit(1) + +status_icon = {"pass": "✅", "partial": "⚠️", "fail": "❌", "unknown": "❓"} +counts = {"pass": 0, "partial": 0, "fail": 0, "unknown": 0} +for _, status in connectors: + counts[status] = counts.get(status, 0) + 1 + + +def category_status(category_ids: set) -> str: + tested = [(n, s) for n, s in connectors if n in category_ids] + if not tested: + return "skip" + if any(s == "pass" for _, s in tested): + return "pass" + if any(s == "partial" for _, s in tested): + return "partial" + return "fail" + + +categories = {key: category_status(ids) for key, ids in { + "microsoft": MSFT, + "google_workspace": GOOGLE, + "token": TOKEN, + "async": ASYNC, + "webhook": WEBHOOK, + "bulk": BULK, +}.items()} + +release_line = f"**Release:** `{release}` \n" if release else "" +lines = [ + f"### {cloud.upper()} connector QA", + "", + release_line.rstrip(), + "", + ( + f"**Summary:** {counts['pass']} passing, {counts['partial']} partial, " + f"{counts['fail']} failing" + + (f", {counts['unknown']} unknown" if counts['unknown'] else "") + + f" (of {len(connectors)} tested)" + ), + "", + "| Connector | Status |", + "|-----------|--------|", +] +for name, status in connectors: + note = "" + if status == "partial": + note = " (health OK, API issue)" + elif status == "fail": + note = " (not configured or setup error)" + lines.append(f"| **{name}** | {status_icon.get(status, '❓')} {status}{note} |") + +lines.extend(["", "#### Test plan categories", ""]) +for key, label in CATEGORY_LABELS.items(): + cat = categories[key] + if cat == "skip": + lines.append(f"- ⏭️ {label} (not tested in this example)") + elif cat == "pass": + lines.append(f"- ✅ {label}") + elif cat == "partial": + lines.append(f"- ⚠️ {label} (partial)") + else: + lines.append(f"- ❌ {label}") + +summary = "\n".join(lines) + "\n" +Path(summary_path).write_text(summary) +Path(checklist_path).write_text( + "\n".join(f"{cloud} {key} {categories[key]}" for key in CATEGORY_LABELS) + "\n" +) +print(summary, end="") +PY diff --git a/tools/release/qa/update-release-pr-results.sh b/tools/release/qa/update-release-pr-results.sh new file mode 100755 index 0000000000..59a91ee50b --- /dev/null +++ b/tools/release/qa/update-release-pr-results.sh @@ -0,0 +1,117 @@ +#!/bin/bash +# Post connector QA summaries to a release PR and check off test-plan items. +# Usage: ./tools/release/qa/update-release-pr-results.sh [aws-summary-md] [gcp-summary-md] +# Checklist files are produced by summarize-connector-tests.sh (*.checklist). + +set -euo pipefail + +COLORSCHEME_SH="$(dirname "$0")/../../set-term-colorscheme.sh" +if [ -f "$COLORSCHEME_SH" ]; then + # shellcheck source=/dev/null + source "$COLORSCHEME_SH" +else + ERR='\033[0;31m'; SUCCESS='\033[0;32m'; WARN='\033[1;33m'; INFO='\033[0;34m'; NC='\033[0m' +fi + +PR_NUMBER="${1:-}" +AWS_CHECKLIST="${2:-}" +GCP_CHECKLIST="${3:-}" +AWS_SUMMARY="${4:-}" +GCP_SUMMARY="${5:-}" + +if [ -z "$PR_NUMBER" ] || [ -z "$AWS_CHECKLIST" ] || [ -z "$GCP_CHECKLIST" ]; then + printf "${ERR}Usage: %s [aws-summary-md] [gcp-summary-md]${NC}\n" "$0" + exit 1 +fi + +for f in "$AWS_CHECKLIST" "$GCP_CHECKLIST"; do + if [ ! -f "$f" ]; then + printf "${ERR}Checklist file not found: %s${NC}\n" "$f" + exit 1 + fi +done + +COMMENT_FILE="$(mktemp)" +BODY_FILE="$(mktemp)" +trap 'rm -f "$COMMENT_FILE" "$BODY_FILE"' EXIT + +{ + echo "## Connector QA results (dev examples)" + echo "" + if [ -n "$AWS_SUMMARY" ] && [ -f "$AWS_SUMMARY" ]; then + cat "$AWS_SUMMARY" + echo "" + fi + if [ -n "$GCP_SUMMARY" ] && [ -f "$GCP_SUMMARY" ]; then + cat "$GCP_SUMMARY" + echo "" + fi + echo "_Generated by \`tools/release/qa/update-release-pr-results.sh\`_" +} > "$COMMENT_FILE" + +gh pr comment "$PR_NUMBER" --body-file "$COMMENT_FILE" +printf "${SUCCESS}Posted QA summary comment on PR #%s.${NC}\n" "$PR_NUMBER" + +CURRENT_BODY="$(gh pr view "$PR_NUMBER" --json body -q .body)" +printf "%s" "$CURRENT_BODY" > "$BODY_FILE" + +python3 - "$BODY_FILE" "$AWS_CHECKLIST" "$GCP_CHECKLIST" <<'PY' +import re +import sys +from pathlib import Path + +body_path, aws_path, gcp_path = sys.argv[1:4] +body = Path(body_path).read_text() +aws = {} +gcp = {} +for path, target in ((aws_path, aws), (gcp_path, gcp)): + for line in Path(path).read_text().splitlines(): + cloud, key, status = line.split() + target[key] = status + +CATEGORY_LABELS = { + "microsoft": "Microsoft API connector", + "google_workspace": "Google Workspace API connector", + "token": "Token-based API connector", + "async": "API connector with async", + "webhook": "Webhook collector", + "bulk": "Bulk connector", +} + + +def checkbox(status: str) -> str: + return "[x]" if status in ("pass", "partial") else "[ ]" + + +def update_section(section_name: str, statuses: dict, text: str) -> str: + pattern = rf"(### {re.escape(section_name)}\s.*?Confirm everything worked:\s*)(.*?)(?=\n### |\Z)" + m = re.search(pattern, text, flags=re.S) + if not m: + return text + header, block = m.group(1), m.group(2) + new_block = block + for key, label in CATEGORY_LABELS.items(): + status = statuses.get(key, "skip") + if status == "skip": + continue + checked = checkbox(status) + new_block = re.sub( + rf"- \[ \] {re.escape(label)}", + f"- {checked} {label}", + new_block, + ) + new_block = re.sub( + rf"- \[x\] {re.escape(label)}", + f"- {checked} {label}", + new_block, + ) + return text[: m.start()] + header + new_block + text[m.end() :] + + +body = update_section("AWS", aws, body) +body = update_section("GCP", gcp, body) +Path(body_path).write_text(body) +PY + +gh pr edit "$PR_NUMBER" --body-file "$BODY_FILE" +printf "${SUCCESS}Updated PR #%s description test-plan checkboxes.${NC}\n" "$PR_NUMBER" diff --git a/tools/release/qa/verify-release-refs.sh b/tools/release/qa/verify-release-refs.sh new file mode 100755 index 0000000000..c8229b3558 --- /dev/null +++ b/tools/release/qa/verify-release-refs.sh @@ -0,0 +1,84 @@ +#!/bin/bash +# Verify release refs were updated from rc-vX.Y.Z to vX.Y.Z before release QA. +# Usage: ./tools/release/qa/verify-release-refs.sh +# Example: ./tools/release/qa/verify-release-refs.sh v0.6.6 + +set -euo pipefail + +COLORSCHEME_SH="$(dirname "$0")/../../set-term-colorscheme.sh" +if [ -f "$COLORSCHEME_SH" ]; then + # shellcheck source=/dev/null + source "$COLORSCHEME_SH" +else + ERR='\033[0;31m'; SUCCESS='\033[0;32m'; WARN='\033[1;33m'; INFO='\033[0;34m'; NC='\033[0m' +fi + +RELEASE="${1:-}" +if [ -z "$RELEASE" ]; then + printf "${ERR}Usage: %s ${NC}\n" "$0" + printf "Example: %s v0.6.6\n" "$0" + exit 1 +fi + +if [[ ! "$RELEASE" =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]]; then + printf "${ERR}Release must look like v0.6.6 (got: %s)${NC}\n" "$RELEASE" + exit 1 +fi + +RC_RELEASE="rc-${RELEASE}" +RELEASE_NUMBER="${RELEASE#v}" +ROOT="$(git rev-parse --show-toplevel 2>/dev/null || pwd)" +cd "$ROOT" + +FAIL=0 + +printf "Verifying release refs for ${INFO}%s${NC} (from ${INFO}%s${NC}) ...\n" "$RELEASE" "$RC_RELEASE" + +CURRENT_BRANCH="$(git branch --show-current)" +if [ "$CURRENT_BRANCH" != "$RC_RELEASE" ] && [ "$CURRENT_BRANCH" != "${RELEASE#v}" ]; then + printf "${WARN}Warning: current branch is '%s'; expected '%s' or a release prep branch.${NC}\n" \ + "$CURRENT_BRANCH" "$RC_RELEASE" +fi + +POM_REVISION="$(sed -n 's:.*\([^<]*\).*:\1:p' java/pom.xml | head -1)" +if [ "$POM_REVISION" != "$RELEASE_NUMBER" ]; then + printf "${ERR}java/pom.xml revision is '%s'; expected '%s'.${NC}\n" "$POM_REVISION" "$RELEASE_NUMBER" + FAIL=1 +else + printf "${SUCCESS}java/pom.xml revision matches %s.${NC}\n" "$RELEASE_NUMBER" +fi + +STALE_RC_REFS="$(git grep -n "ref=${RC_RELEASE}" -- 'infra/' 'java/' 'tools/' 2>/dev/null || true)" +if [ -n "$STALE_RC_REFS" ]; then + printf "${ERR}Found stale module refs to %s:${NC}\n%s\n" "$RC_RELEASE" "$STALE_RC_REFS" + FAIL=1 +else + printf "${SUCCESS}No stale ref=%s module references under infra/, java/, or tools/.${NC}\n" "$RC_RELEASE" +fi + +MISSING_RELEASE_REFS="$( (git grep -n "ref=${RELEASE}" -- 'infra/examples-dev/' 2>/dev/null || true) | wc -l | tr -d ' ' )" +if [ "$MISSING_RELEASE_REFS" -eq 0 ]; then + printf "${WARN}No commented ref=%s lines found in infra/examples-dev/ (expected in example .tf files).${NC}\n" "$RELEASE" +else + printf "${SUCCESS}Found %s commented ref=%s references in examples-dev.${NC}\n" "$MISSING_RELEASE_REFS" "$RELEASE" +fi + +STALE_RC_STRINGS="$(git grep -n "${RC_RELEASE}" -- 'infra/' 'java/' 'tools/' 2>/dev/null \ + | grep -v 'verify-release-refs.sh' \ + | grep -v 'prep.sh' \ + | grep -v 'releases.md' \ + | grep -v 'upgrade-terraform-modules.sh' \ + | grep -v 'HealthCheckResultTest.java' \ + || true)" +if [ -n "$STALE_RC_STRINGS" ]; then + printf "${WARN}Other %s string references remain (review; may be OK in docs/tests):${NC}\n%s\n" \ + "$RC_RELEASE" "$STALE_RC_STRINGS" +fi + +if [ "$FAIL" -ne 0 ]; then + printf "\n${ERR}Release ref verification failed.${NC}\n" + printf "Run from repo root: ${INFO}./tools/release/prep.sh %s %s${NC}\n" "$RC_RELEASE" "$RELEASE" + exit 1 +fi + +printf "\n${SUCCESS}Release ref verification passed for %s.${NC}\n" "$RELEASE" diff --git a/tools/release/release-qa.md b/tools/release/release-qa.md new file mode 100644 index 0000000000..7a26e8e3ab --- /dev/null +++ b/tools/release/release-qa.md @@ -0,0 +1,163 @@ +# Release QA + +End-to-end QA before merging an `rc-vX.Y.Z` branch to `main`. Run from the repository root. + +This supplements [releases.md](../../docs/development/releases.md) and automates the dev-example apply/test workflow described in [test_plan.md](test_plan.md). + +## Prerequisites + +- Branch `rc-vX.Y.Z` with release refs updated to `vX.Y.Z` (via `./tools/release/prep.sh`) +- CLI auth: `aws`, `gcloud` (+ application-default credentials), and `az` when `msft_tenant_id` is set in tfvars +- `gh` authenticated (for PR steps) +- `terraform` in PATH + +## Quick start + +```shell +./tools/release/run-release-qa.sh v0.6.6 +``` + +Runs verify → apply (AWS, then GCP) → test-all (both) → summarize. Stops before creating the +release PR so you can review plan logs and connector summaries. + +To also open the release PR and post results: + +```shell +./tools/release/run-release-qa.sh v0.6.6 --create-pr --post-pr-results +``` + +## Workflow + +Run steps **sequentially**. Do not apply AWS and GCP in parallel. + +| Step | Action | Script | +|------|--------|--------| +| 1 | Verify release refs | `qa/verify-release-refs.sh` | +| 2 | Apply AWS example (review plan log) | `qa/apply-example.sh aws` | +| 3 | Apply GCP example (review plan log) | `qa/apply-example.sh gcp` | +| 4 | Run AWS connector tests | `qa/run-example-tests.sh aws` | +| 5 | Run GCP connector tests | `qa/run-example-tests.sh gcp` | +| 6 | Summarize connector state | `qa/summarize-connector-tests.sh` | +| 7 | Create release PR | `rc-to-main.sh` | +| 8 | Comment on PR + check off test plan | `qa/update-release-pr-results.sh` | + +### Step 1: Verify release refs + +If refs are not updated yet: + +```shell +./tools/release/prep.sh rc-vX.Y.Z vX.Y.Z +``` + +Then: + +```shell +./tools/release/qa/verify-release-refs.sh vX.Y.Z +``` + +### Steps 2–3: Apply dev examples + +```shell +./tools/release/qa/apply-example.sh aws vX.Y.Z true +./tools/release/qa/apply-example.sh gcp vX.Y.Z true +``` + +Logs: `infra/examples-dev/{aws,gcp}/YYYYMMDD_{aws|gcp}-vX.Y.Z-{plan,apply}.txt` + +Review plan logs for unexpected destroys or replacements before running tests. + +### Steps 4–5: Connector tests + +```shell +./tools/release/qa/run-example-tests.sh aws vX.Y.Z +./tools/release/qa/run-example-tests.sh gcp vX.Y.Z +``` + +Outputs: `infra/examples-dev/{aws,gcp}/YYYYMMDD_{aws|gcp}-vX.Y.Z-tests.txt` + +Allow several minutes per cloud (Slack async, bulk uploads, llm-portal bucket polling). + +### Step 6: Summarize + +```shell +./tools/release/qa/summarize-connector-tests.sh aws infra/examples-dev/aws/...-tests.txt vX.Y.Z +./tools/release/qa/summarize-connector-tests.sh gcp infra/examples-dev/gcp/...-tests.txt vX.Y.Z +``` + +Each run writes: + +- `*.summary.md` — markdown tables and category breakdown +- `*.checklist` — pass/fail per test-plan category (for PR checkbox updates) + +#### Result statuses + +| Status | Meaning | +|--------|---------| +| pass | Health + API/bulk/webhook verification succeeded | +| partial | Proxy healthy but upstream API rejected the call | +| fail | Missing secrets/config or connection setup error | + +#### Test-plan categories + +From [test_plan.md](test_plan.md). A category is checked off when at least one connector in that category passes (partial counts). + +| Category | Example connectors | +|----------|-------------------| +| Microsoft API | `azure-ad`, `outlook-cal`, `msft-teams` | +| Google Workspace API | `gcal`, `gdirectory`, `google-chat`, `gmail`, `gemini-in-workspace-apps` | +| Token-based API | `asana`, `slack-analytics`, `zoom`, `jira-cloud`, `github`, … | +| API with async | `slack-analytics` | +| Webhook collector | `llm-portal` | +| Bulk connector | `hris`, `metrics`, `workdata-generic` | + +Distinguish credential gaps (expected for unconfigured connectors) from proxy regressions in summaries. + +### Step 7: Create release PR + +On `rc-vX.Y.Z`: + +```shell +./tools/release/rc-to-main.sh vX.Y.Z +``` + +Partially interactive (`npm audit fix` prompt). Note the PR number from output. + +### Step 8: Post QA on the PR + +```shell +./tools/release/qa/update-release-pr-results.sh \ + \ + infra/examples-dev/aws/...-tests.txt.checklist \ + infra/examples-dev/gcp/...-tests.txt.checklist \ + infra/examples-dev/aws/...-tests.txt.summary.md \ + infra/examples-dev/gcp/...-tests.txt.summary.md +``` + +Posts a comment with both summaries and checks off test-plan items in the PR body. + +### After merge + +```shell +./tools/release/publish.sh vX.Y.Z +``` + +## Troubleshooting + +| Issue | Action | +|-------|--------| +| `verify-release-refs.sh` fails | Run `./tools/release/prep.sh rc-vX.Y.Z vX.Y.Z` | +| Apply auth errors | `./az-auth`, `aws sso login`, `gcloud auth application-default login` | +| `missingConfigProperties` in health check | Unconfigured secrets; note in summary, not a proxy bug | +| `msft-teams` 401 while `azure-ad` works | Azure Graph permissions/consent | +| `rc-to-main.sh` branch error | `git checkout rc-vX.Y.Z` | + +## Scripts + +| Script | Purpose | +|--------|---------| +| [run-release-qa.sh](run-release-qa.sh) | Orchestrates the QA workflow | +| [qa/verify-release-refs.sh](qa/verify-release-refs.sh) | Confirm rc → v ref migration | +| [qa/apply-example.sh](qa/apply-example.sh) | Plan + apply with logs | +| [qa/run-example-tests.sh](qa/run-example-tests.sh) | Run `test-all.sh`, capture output | +| [qa/summarize-connector-tests.sh](qa/summarize-connector-tests.sh) | Parse test output → markdown | +| [qa/update-release-pr-results.sh](qa/update-release-pr-results.sh) | PR comment + checkbox update | diff --git a/tools/release/run-release-qa.sh b/tools/release/run-release-qa.sh new file mode 100755 index 0000000000..4fedf1e853 --- /dev/null +++ b/tools/release/run-release-qa.sh @@ -0,0 +1,195 @@ +#!/bin/bash +# Orchestrate release QA: verify refs, apply examples, test connectors, summarize. +# +# Usage: +# ./tools/release/run-release-qa.sh [options] +# +# Options: +# --force-bundle Pass to apply-example (default: true) +# --skip-verify Skip release ref verification +# --skip-apply Skip terraform apply (use existing deployments) +# --skip-tests Skip test-all +# --create-pr Run rc-to-main.sh after tests (interactive) +# --post-pr-results Post summaries to PR (requires --pr-number or PR from --create-pr) +# --pr-number PR to update (for --post-pr-results without --create-pr) +# +# Examples: +# ./tools/release/run-release-qa.sh v0.6.6 +# ./tools/release/run-release-qa.sh v0.6.6 --create-pr --post-pr-results + +set -euo pipefail + +COLORSCHEME_SH="$(dirname "$0")/../set-term-colorscheme.sh" +if [ -f "$COLORSCHEME_SH" ]; then + # shellcheck source=/dev/null + source "$COLORSCHEME_SH" +else + ERR='\033[0;31m'; SUCCESS='\033[0;32m'; WARN='\033[1;33m'; INFO='\033[0;34m'; NC='\033[0m' +fi + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" +QA_DIR="${SCRIPT_DIR}/qa" + +RELEASE="" +FORCE_BUNDLE="true" +SKIP_VERIFY=false +SKIP_APPLY=false +SKIP_TESTS=false +CREATE_PR=false +POST_PR_RESULTS=false +PR_NUMBER="" + +usage() { + sed -n '2,16p' "$0" | sed 's/^# \{0,1\}//' + exit 1 +} + +while [ $# -gt 0 ]; do + case "$1" in + -h|--help) usage ;; + --force-bundle) + FORCE_BUNDLE="${2:?--force-bundle requires true or false}" + shift 2 + ;; + --skip-verify) SKIP_VERIFY=true; shift ;; + --skip-apply) SKIP_APPLY=true; shift ;; + --skip-tests) SKIP_TESTS=true; shift ;; + --create-pr) CREATE_PR=true; shift ;; + --post-pr-results) POST_PR_RESULTS=true; shift ;; + --pr-number) + PR_NUMBER="${2:?--pr-number requires a value}" + shift 2 + ;; + v[0-9]*.[0-9]*.[0-9]*) + RELEASE="$1" + shift + ;; + *) + printf "${ERR}Unknown argument: %s${NC}\n" "$1" >&2 + usage + ;; + esac +done + +if [ -z "$RELEASE" ]; then + printf "${ERR}Release version required (e.g. v0.6.6).${NC}\n" >&2 + usage +fi + +if [[ ! "$RELEASE" =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]]; then + printf "${ERR}Release must look like v0.6.6 (got: %s).${NC}\n" "$RELEASE" >&2 + exit 1 +fi + +RC_BRANCH="rc-${RELEASE}" +DATE_STAMP="$(date +%Y%m%d)" +AWS_TEST_LOG="" +GCP_TEST_LOG="" + +cd "$ROOT" + +printf "${INFO}Release QA for %s${NC}\n" "$RELEASE" +printf "Documentation: tools/release/release-qa.md\n\n" + +if [ "$SKIP_VERIFY" = false ]; then + printf "=== Step 1: Verify release refs ===\n" + "${QA_DIR}/verify-release-refs.sh" "$RELEASE" + printf "\n" +fi + +if [ "$SKIP_APPLY" = false ]; then + printf "=== Steps 2–3: Apply dev examples (sequential) ===\n" + "${QA_DIR}/apply-example.sh" aws "$RELEASE" "$FORCE_BUNDLE" + printf "\n" + "${QA_DIR}/apply-example.sh" gcp "$RELEASE" "$FORCE_BUNDLE" + printf "\n" +fi + +if [ "$SKIP_TESTS" = false ]; then + printf "=== Steps 4–5: Run connector tests ===\n" + "${QA_DIR}/run-example-tests.sh" aws "$RELEASE" + AWS_TEST_LOG="${ROOT}/infra/examples-dev/aws/${DATE_STAMP}_aws-${RELEASE}-tests.txt" + printf "\n" + "${QA_DIR}/run-example-tests.sh" gcp "$RELEASE" + GCP_TEST_LOG="${ROOT}/infra/examples-dev/gcp/${DATE_STAMP}_gcp-${RELEASE}-tests.txt" + printf "\n" +else + AWS_TEST_LOG="$(ls -t "${ROOT}/infra/examples-dev/aws/"*"_aws-${RELEASE}-tests.txt" 2>/dev/null | head -1 || true)" + GCP_TEST_LOG="$(ls -t "${ROOT}/infra/examples-dev/gcp/"*"_gcp-${RELEASE}-tests.txt" 2>/dev/null | head -1 || true)" +fi + +printf "=== Step 6: Summarize connector results ===\n" +if [ -z "$AWS_TEST_LOG" ] || [ ! -f "$AWS_TEST_LOG" ]; then + printf "${ERR}AWS test log not found. Run tests or pass a log via --skip-tests after a prior run.${NC}\n" >&2 + exit 1 +fi +if [ -z "$GCP_TEST_LOG" ] || [ ! -f "$GCP_TEST_LOG" ]; then + printf "${ERR}GCP test log not found.${NC}\n" >&2 + exit 1 +fi + +"${QA_DIR}/summarize-connector-tests.sh" aws "$AWS_TEST_LOG" "$RELEASE" +printf "\n" +"${QA_DIR}/summarize-connector-tests.sh" gcp "$GCP_TEST_LOG" "$RELEASE" +printf "\n" + +AWS_CHECKLIST="${AWS_TEST_LOG}.checklist" +GCP_CHECKLIST="${GCP_TEST_LOG}.checklist" +AWS_SUMMARY="${AWS_TEST_LOG}.summary.md" +GCP_SUMMARY="${GCP_TEST_LOG}.summary.md" + +if [ "$CREATE_PR" = true ]; then + printf "=== Step 7: Create release PR ===\n" + CURRENT_BRANCH="$(git branch --show-current)" + if [ "$CURRENT_BRANCH" != "$RC_BRANCH" ]; then + printf "${WARN}Checking out %s ...${NC}\n" "$RC_BRANCH" + git checkout "$RC_BRANCH" + fi + PR_OUTPUT="$("${SCRIPT_DIR}/rc-to-main.sh" "$RELEASE" 2>&1 | tee /dev/stderr)" + if [ -z "$PR_NUMBER" ]; then + PR_NUMBER="$(echo "$PR_OUTPUT" | sed -n 's|.*/pull/\([0-9]*\).*|\1|p' | head -1)" + fi + printf "\n" +fi + +if [ "$POST_PR_RESULTS" = true ]; then + if [ -z "$PR_NUMBER" ]; then + printf "${ERR}--post-pr-results requires --pr-number or a successful --create-pr.${NC}\n" >&2 + exit 1 + fi + printf "=== Step 8: Post QA results on PR #%s ===\n" "$PR_NUMBER" + "${QA_DIR}/update-release-pr-results.sh" \ + "$PR_NUMBER" \ + "$AWS_CHECKLIST" \ + "$GCP_CHECKLIST" \ + "$AWS_SUMMARY" \ + "$GCP_SUMMARY" + printf "\n" +fi + +printf "${SUCCESS}Release QA complete for %s.${NC}\n" "$RELEASE" +printf "\nArtifacts:\n" +printf " AWS tests: %s\n" "$AWS_TEST_LOG" +printf " GCP tests: %s\n" "$GCP_TEST_LOG" +printf " AWS summary: %s\n" "$AWS_SUMMARY" +printf " GCP summary: %s\n" "$GCP_SUMMARY" + +if [ "$CREATE_PR" = false ]; then + printf "\nNext steps:\n" + printf " git checkout %s\n" "$RC_BRANCH" + printf " ./tools/release/rc-to-main.sh %s\n" "$RELEASE" + printf " ./tools/release/qa/update-release-pr-results.sh \\\n" + printf " %s %s %s %s\n" "$AWS_CHECKLIST" "$GCP_CHECKLIST" "$AWS_SUMMARY" "$GCP_SUMMARY" +fi + +if [ "$CREATE_PR" = true ] && [ "$POST_PR_RESULTS" = false ]; then + printf "\nPost QA to PR:\n" + printf " ./tools/release/qa/update-release-pr-results.sh %s \\\n" "$PR_NUMBER" + printf " %s %s %s %s\n" "$AWS_CHECKLIST" "$GCP_CHECKLIST" "$AWS_SUMMARY" "$GCP_SUMMARY" +fi + +if [ "$CREATE_PR" = true ] && [ "$POST_PR_RESULTS" = true ]; then + printf "\nAfter merge to main:\n" + printf " ./tools/release/publish.sh %s\n" "$RELEASE" +fi diff --git a/tools/upgrade-terraform-modules.sh b/tools/upgrade-terraform-modules.sh index a829da0e49..0f237c0175 100755 --- a/tools/upgrade-terraform-modules.sh +++ b/tools/upgrade-terraform-modules.sh @@ -105,5 +105,31 @@ if [[ "$NEXT_MINOR" =~ ^[0-9]+$ ]] && [[ "$CURRENT_MINOR" =~ ^[0-9]+$ ]]; then fi fi +UPGRADE_GUIDE_URL="https://github.com/Worklytics/psoxy/blob/main/docs/guides/upgrading-versions.md#reviewing-your-terraform-plan" +PLAN_TIMESTAMP=$(date +%Y%m%d-%H%M%S) +PLAN_FILE="${PLAN_TIMESTAMP}-upgrade-plan.txt" + printf "\n${WARN}NOTE:${NC} No changes have yet been made to your infrastructure.\n" -printf "The updated Terraform configuration must still be applied. Run ${CODE}terraform plan${NC} followed by ${CODE}terraform apply${NC} to provision these upgrades.\n" \ No newline at end of file +printf "The updated Terraform configuration must still be applied.\n\n" + +printf "Before applying, preview what will change:\n" +printf " ${CODE}terraform plan${NC}\n\n" +printf "To save the plan for review, redirect output to a dated file:\n" +printf " ${CODE}terraform plan -no-color > \"${PLAN_FILE}\" 2>&1${NC}\n\n" + +printf "${WARN}Review the plan carefully${NC} before running ${CODE}terraform apply${NC}.\n" +printf "Consider sharing it with Worklytics support, teammates, or an LLM.\n" +printf "Full guidance: ${INFO}${UPGRADE_GUIDE_URL}${NC}\n\n" + +printf "Key items to watch for:\n" +printf " - ${ERR}Rotating/destroying the pseudonymization SALT${NC} — previously pseudonymized data will be inconsistent; restore the prior salt or re-ingest all data to Worklytics\n" +printf " - ${ERR}Replacing Lambda/Cloud Function resources${NC} — especially their function URLs (update connections in Worklytics)\n" +printf " - ${ERR}Replacing any -input buckets${NC} — update data pipelines that write to these buckets\n" +printf " - ${ERR}Replacing any -sanitized buckets${NC} — update connections in Worklytics\n" +printf " - ${ERR}Replacing parameters/secrets with API credentials${NC} — when NOT managed by this Terraform configuration (recover credentials or obtain new ones)\n" +printf " - ${ERR}Replacing the IAM role used by Worklytics${NC} — to invoke cloud functions or read from -sanitized buckets (update connections in Worklytics)\n\n" + +printf "Example LLM prompt:\n" +printf " ${CODE}Review and summarize the output of terraform plan stored in ${PLAN_FILE}. Flag any high-risk changes, especially destruction or replacement of the pseudonymization SALT, Lambda/Cloud Function resources (and their function URLs), -input buckets, -sanitized buckets, unmanaged API credential parameters/secrets, and the IAM role Worklytics uses to invoke functions or read sanitized buckets. For each issue, explain the operational impact and what I must do before applying.${NC}\n\n" + +printf "When the plan looks safe, run ${CODE}terraform apply${NC} to provision these upgrades.\n" \ No newline at end of file