Skip to content

[FOR TESTING CI/CD] chore: emptiness is calming#1590

Closed
vladfrangu wants to merge 1 commit into
masterfrom
chore/empty-test-branch-for-tests
Closed

[FOR TESTING CI/CD] chore: emptiness is calming#1590
vladfrangu wants to merge 1 commit into
masterfrom
chore/empty-test-branch-for-tests

Conversation

@vladfrangu

Copy link
Copy Markdown
Member

Just a PR to test out docs preview changes

@github-actions github-actions Bot added this to the 115th sprint - Tooling team milestone May 19, 2025
@github-actions github-actions Bot added the t-tooling Issues with this label are in the ownership of the tooling team. label May 19, 2025
@apify-service-account

Copy link
Copy Markdown
Contributor

Preview for this PR was built for commit 141177e and is ready at https://pr-1590.preview.docs.apify.com!

@vladfrangu vladfrangu changed the title chore: emptiness is calming [FOR TESTING CI/CD] chore: emptiness is calming Jun 9, 2025
@B4nan B4nan closed this Jul 22, 2025
@B4nan B4nan deleted the chore/empty-test-branch-for-tests branch November 25, 2025 10:50
@webrdaniel

Copy link
Copy Markdown
Contributor

@vladfrangu the preview of this PR is still live and indexed. New previews have correctly noindex, but since this is older, it's not there. Could you please try to take this preview down?

image

@vladfrangu

Copy link
Copy Markdown
Member Author

oh damn, how is this still alive 🙃, yep, will nuke

@vladfrangu

Copy link
Copy Markdown
Member Author

Deleted, let me know if there's any other preview still alive

@marekh19

Copy link
Copy Markdown
Contributor

Just checked https://www.google.com/search?q=site:preview.docs.apify.com

There are more pages from preview environments that are still live and indexed

I investigated for a bit with Claude, summary of the findings @vladfrangu


Spot-checked PRs surfacing in https://www.google.com/search?q=site:preview.docs.apify.com.
The picture is worse than "stale S3 prefixes". Three separate problems are at play:

  1. Wildcard nginx routing makes SDK/Client/CLI paths permanently reachable on every
    preview subdomain.
    In nginx.conf for *.preview.docs.apify.com, the location
    blocks for /api/client/js/, /api/client/python/, /sdk/js/, /sdk/python/,
    and /cli/ proxy directly to apify.github.io/* and ignore the subdomain entirely.
    pr-1599.preview.docs.apify.com/sdk/js/reference/... and
    pr-1897.preview.docs.apify.com/api/client/js/reference/... both return 200 even
    though those PRs are merged and their S3 prefixes are empty — so does
    pr-99999999.preview.docs.apify.com/sdk/js/reference/... for a PR that never
    existed. The response has no X-Robots-Tag header and the served HTML has no
    <meta name="robots"> tag (the upstream apify.github.io/* pages also have no
    such tag), so Google indexes them. This appears to be the main source of indexed
    URLs in the sample we checked.

  2. NO_INDEX: true on the Docusaurus build was only added on 2025-06-18
    (apify-docs-private commit e243238b, "feat: Don't index preview docs PRs",
    one-line diff adding NO_INDEX: true to the deploy env). Previews whose last
    build ran before that date had no noindex meta tag in the Docusaurus HTML and
    were indexable by design. And even after that date, the flag only affects the
    Docusaurus-built HTML in S3 — it does nothing for the upstream-proxied
    SDK/Client/CLI paths in (1).

  3. The teardown pipeline can fail silently at three layers, with no alerting:

    • The Docs PR Previews workflow in apify-docs did not run on the closed
      event for chore: bump Docusaurus + React, use yarn #1615 (closed 2025-07-16) — no workflow run exists for branch
      bump/react-docusaurus on that date. Root cause unconfirmed (could be a
      missed event delivery, a transient Actions issue, or something else), but the
      net effect is no teardown was dispatched. The S3 prefix pr-1615/ is still
      live today (5+ keys), and pr-1615.preview.docs.apify.com/ returns 200 with
      no <meta robots> tag — that PR's last preview build ran on 2025-06-05,
      before the NO_INDEX commit. This is the only PR in the sample where the
      S3 deploy itself was never torn down.
    • gh workflow run from the public to the private repo only confirms the
      dispatch was accepted, not that the downstream workflow executed or succeeded.
    • aws s3 rm --recursive exits 0 even when no objects matched the prefix, so
      the teardown job is marked ✅ even if there was nothing to delete (or if the
      prefix was already empty for an unrelated reason). The follow-up "preview
      deleted" PR comment uses if: always(), so it posts regardless of outcome.

The cleanest single fix that covers all three is at the edge: add
add_header X-Robots-Tag "noindex, nofollow" always; to the *.preview.docs.apify.com
server block in nginx.conf. Build-independent, covers both the wildcard-proxied
paths and the S3-served Docusaurus paths, and masks future teardown failures.

PRs examined

@vladfrangu

Copy link
Copy Markdown
Member Author

Yep, I got a report from SEO team too about this, and will work on the two fixes that should prevent Google from indexing future previews

@vladfrangu

Copy link
Copy Markdown
Member Author

It also looks like only 1 PR is actually still wrongly live, the rest 404 so we just need to tell Google to delete them?

@marekh19

Copy link
Copy Markdown
Contributor

Not fully, the base URL returns 404, but there are some oprhan pages from different PRs. If you check the search result I shared, there's records still live on other preview deployments (usually only one or couple of pages).

e.g. https://pr-1599.preview.docs.apify.com/sdk/js/reference/3.5/interface/DatasetReducer

curl -I https://pr-1599.preview.docs.apify.com/sdk/js/reference/3.5/interface/DatasetReducer

HTTP/2 200
date: Mon, 25 May 2026 10:42:59 GMT
content-type: text/html; charset=utf-8
content-length: 36968
server: nginx
x-origin-cache: HIT
last-modified: Fri, 22 May 2026 10:24:15 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: "6a102ecf-9068"
expires: Mon, 25 May 2026 10:52:58 GMT
cache-control: max-age=600
x-proxy-cache: MISS
x-github-request-id: EE90:187756:1031CCB:123B634:6A1427B2
accept-ranges: bytes
age: 0
via: 1.1 varnish
x-served-by: cache-iad-kjyo7100098-IAD
x-cache: MISS
x-cache-hits: 0
x-timer: S1779705779.942886,VS0,VE546
vary: Accept-Encoding
x-fastly-request-id: f25deaf293c2f69fb4ec9b7bfc4d96dd66a3e217
x-frame-options: SAMEORIGIN

@vladfrangu

Copy link
Copy Markdown
Member Author

Can you poke me on slack please about this, I'd also love to see if we can get these removed from Google automatically. I'll see if I can list all pr preview buckets and nuke inactive ones

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants