Skip to content

fix(nginx): add resolver and upstream resolve to prevent stale IP rou…#4295

Open
ilias-115 wants to merge 1 commit intogetsentry:masterfrom
ilias-115:fix/nginx-upstream-dns-refresh
Open

fix(nginx): add resolver and upstream resolve to prevent stale IP rou…#4295
ilias-115 wants to merge 1 commit intogetsentry:masterfrom
ilias-115:fix/nginx-upstream-dns-refresh

Conversation

@ilias-115
Copy link
Copy Markdown

Why

In the default self-hosted Docker Compose deployment, nginx proxies traffic to
the relay and web services by hostname.

When those containers are recreated and receive a new container IP, nginx may
continue using a stale upstream address until nginx itself is reloaded or
restarted.

Observed symptom:

  • connect() failed (...) while connecting to upstream
  • nginx attempts to reach a stale upstream container IP

What changed

  • Added a DNS resolver in nginx.conf
  • Enabled runtime DNS re-resolution for upstream backends:
    • relay:3000 resolve
    • web:9000 resolve
  • Added zone directives required by nginx for dynamic upstream resolution

Config details

  • resolver 127.0.0.11 valid=30s;
  • resolver_timeout 5s;
  • zone relay 64k;
  • zone sentry 64k;

Why these values

  • 127.0.0.11: Docker embedded DNS in the default self-hosted container network
  • valid=30s: balances recovery time and DNS lookup overhead
  • resolver_timeout=5s: avoids long stalls on DNS issues
  • 64k zone: sufficient for these small upstream groups

Test plan

  • nginx starts successfully with the updated config
  • traffic is routed correctly to relay and web
  • after recreating relay, nginx no longer keeps using a stale upstream IP
  • manual local verification of continued ingestion after backend IP change

Notes

This change is intended to make nginx more resilient to backend container IP
changes in the default self-hosted Docker Compose setup.

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

@aminvakil
Copy link
Copy Markdown
Collaborator

This has been discussed in #4079 and has been decided not to add this then.

@moroine
Copy link
Copy Markdown
Contributor

moroine commented Apr 24, 2026

I see the workaround with the depends_on but somehow it didn't worked on our case. I think it's not restarted when web or relay are recreated after a crash.

I get the point with the DNS server which can vary depending on host, IMO we have 2 ways:

  • retrieve from /etc/resolv.conf
  • use the old fashion way:
set $upstream "http://relay:3000";
proxy_pass $upstream;

@ilias-115
Copy link
Copy Markdown
Author

ilias-115 commented Apr 24, 2026

This has been discussed in #4079 and has been decided not to add this then.

This PR comes from a real production incident on our side, not from a theoretical optimization.

According to Docker docs, depends_on only guarantees dependent restart on explicit Compose operations (e.g. docker compose restart), not all runtime lifecycle events.
For crash/OOM/runtime restarts, behavior is controlled by Docker restart policies (always, unless-stopped, on-failure).

For DNS portability, we can avoid hardcoding by retrieving resolver IPs from /etc/resolv.conf.

References:
https://docs.docker.com/compose/how-tos/startup-order/
-> “restart: true ensures that if db is updated or restarted due to an explicit Compose operation, for example docker compose restart"
docker/compose#12477 (comment)
-> systemctl restart docker is unrelated to docker compose.

@aminvakil
Copy link
Copy Markdown
Collaborator

Sorry, I've mentioned the wrong pull request.
Here is the issue:
#3894

I see the workaround with the depends_on but somehow it didn't worked on our case. I think it's not restarted when web or relay are recreated after a crash.

I get the point with the DNS server which can vary depending on host, IMO we have 2 ways:

* retrieve from `/etc/resolv.conf`

* use the old fashion way:

Using DNS servers from /etc/resolv.conf does not work in compose, your upstream DNS server in /etc/resolv.conf do not know what is web and relay.

set $upstream "http://relay:3000";
proxy_pass $upstream;

Simply bad practice, you should use resolver when you need this behaviour, do not push random configurations to push your PR.

This has been discussed in #4079 and has been decided not to add this then.

This PR comes from a real production incident on our side, not from a theoretical optimization.

According to Docker docs, depends_on only guarantees dependent restart on explicit Compose operations (e.g. docker compose restart), not all runtime lifecycle events. For crash/OOM/runtime restarts, behavior is controlled by Docker restart policies (always, unless-stopped, on-failure).

Agreed, but as I've stated in #3894, the real fix would be to understand why relay has been crashed, not pushing latency to all users in self-hosted.

For DNS portability, we can avoid hardcoding by retrieving resolver IPs from /etc/resolv.conf.

Answered above, it does not work.

References: https://docs.docker.com/compose/how-tos/startup-order/ -> “restart: true ensures that if db is updated or restarted due to an explicit Compose operation, for example docker compose restart" docker/compose#12477 (comment) -> systemctl restart docker is unrelated to docker compose.

You're correct. Current startup-order does not handle all the problems which may arise in self-hosted, but I still do not think we should use resolver in all self-hosted installations of all users.

But I'm not a maintainer of this project and I only state my opinions here, my word is not final :)

@aldy505 What do you think?

@ilias-115
Copy link
Copy Markdown
Author

Sorry, I've mentioned the wrong pull request. Here is the issue: #3894

I see the workaround with the depends_on but somehow it didn't worked on our case. I think it's not restarted when web or relay are recreated after a crash.
I get the point with the DNS server which can vary depending on host, IMO we have 2 ways:

* retrieve from `/etc/resolv.conf`

* use the old fashion way:

Using DNS servers from /etc/resolv.conf does not work in compose, your upstream DNS server in /etc/resolv.conf do not know what is web and relay.

set $upstream "http://relay:3000";
proxy_pass $upstream;

Simply bad practice, you should use resolver when you need this behaviour, do not push random configurations to push your PR.

This has been discussed in #4079 and has been decided not to add this then.

This PR comes from a real production incident on our side, not from a theoretical optimization.
According to Docker docs, depends_on only guarantees dependent restart on explicit Compose operations (e.g. docker compose restart), not all runtime lifecycle events. For crash/OOM/runtime restarts, behavior is controlled by Docker restart policies (always, unless-stopped, on-failure).

Agreed, but as I've stated in #3894, the real fix would be to understand why relay has been crashed, not pushing latency to all users in self-hosted.

For DNS portability, we can avoid hardcoding by retrieving resolver IPs from /etc/resolv.conf.

Answered above, it does not work.

References: https://docs.docker.com/compose/how-tos/startup-order/ -> “restart: true ensures that if db is updated or restarted due to an explicit Compose operation, for example docker compose restart" docker/compose#12477 (comment) -> systemctl restart docker is unrelated to docker compose.

You're correct. Current startup-order does not handle all the problems which may arise in self-hosted, but I still do not think we should use resolver in all self-hosted installations of all users.

But I'm not a maintainer of this project and I only state my opinions here, my word is not final :)

@aldy505 What do you think?

Thanks for the detailed feedback — this makes sense.

Just to clarify one point: when I mentioned retrieving DNS from /etc/resolv.conf, I meant the file inside the nginx container, not on the host. In Docker user-defined networks this is typically Docker’s embedded DNS, which resolves web and relay service names.

Also, we hit this not only after crashes, but during normal operations too (VM reboot / Docker daemon restart). So this is not only a “fix crash root cause” topic — it is also an upstream DNS resilience issue for expected runtime events where container IPs can change.

I agree root-cause work is still needed, but it does not remove the need for robust upstream re-resolution in nginx.

If changing default behavior is out of scope, I’m happy to re-scope this as:

  • an opt-in behavior
  • documentation-backed workaround for users affected by stale upstream IPs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants