fix(nginx): add resolver and upstream resolve to prevent stale IP rou… by ilias-115 · Pull Request #4295 · getsentry/self-hosted

ilias-115 · 2026-04-21T21:38:20Z

Why

In the default self-hosted Docker Compose deployment, nginx proxies traffic to
the relay and web services by hostname.

When those containers are recreated and receive a new container IP, nginx may
continue using a stale upstream address until nginx itself is reloaded or
restarted.

Observed symptom:

connect() failed (...) while connecting to upstream
nginx attempts to reach a stale upstream container IP

What changed

Added a DNS resolver in nginx.conf
Enabled runtime DNS re-resolution for upstream backends:
- relay:3000 resolve
- web:9000 resolve
Added zone directives required by nginx for dynamic upstream resolution

Config details

resolver 127.0.0.11 valid=30s;
resolver_timeout 5s;
zone relay 64k;
zone sentry 64k;

Why these values

127.0.0.11: Docker embedded DNS in the default self-hosted container network
valid=30s: balances recovery time and DNS lookup overhead
resolver_timeout=5s: avoids long stalls on DNS issues
64k zone: sufficient for these small upstream groups

Test plan

nginx starts successfully with the updated config
traffic is routed correctly to relay and web
after recreating relay, nginx no longer keeps using a stale upstream IP
manual local verification of continued ingestion after backend IP change

Notes

This change is intended to make nginx more resilient to backend container IP
changes in the default self-hosted Docker Compose setup.

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

…ting

aminvakil · 2026-04-23T20:22:45Z

This has been discussed in #4079 and has been decided not to add this then.

moroine · 2026-04-24T06:58:54Z

I see the workaround with the depends_on but somehow it didn't worked on our case. I think it's not restarted when web or relay are recreated after a crash.

I get the point with the DNS server which can vary depending on host, IMO we have 2 ways:

retrieve from /etc/resolv.conf
use the old fashion way:

set $upstream "http://relay:3000";
proxy_pass $upstream;

ilias-115 · 2026-04-24T08:40:53Z

This has been discussed in #4079 and has been decided not to add this then.

This PR comes from a real production incident on our side, not from a theoretical optimization.

According to Docker docs, depends_on only guarantees dependent restart on explicit Compose operations (e.g. docker compose restart), not all runtime lifecycle events.
For crash/OOM/runtime restarts, behavior is controlled by Docker restart policies (always, unless-stopped, on-failure).

For DNS portability, we can avoid hardcoding by retrieving resolver IPs from /etc/resolv.conf.

References:
https://docs.docker.com/compose/how-tos/startup-order/
-> “restart: true ensures that if db is updated or restarted due to an explicit Compose operation, for example docker compose restart"
docker/compose#12477 (comment)
-> systemctl restart docker is unrelated to docker compose.

aminvakil · 2026-04-25T10:09:39Z

Sorry, I've mentioned the wrong pull request.
Here is the issue:
#3894

I see the workaround with the depends_on but somehow it didn't worked on our case. I think it's not restarted when web or relay are recreated after a crash.

I get the point with the DNS server which can vary depending on host, IMO we have 2 ways:
* retrieve from `/etc/resolv.conf`

* use the old fashion way:

Using DNS servers from /etc/resolv.conf does not work in compose, your upstream DNS server in /etc/resolv.conf do not know what is web and relay.

set $upstream "http://relay:3000";
proxy_pass $upstream;

Simply bad practice, you should use resolver when you need this behaviour, do not push random configurations to push your PR.

This has been discussed in #4079 and has been decided not to add this then.

This PR comes from a real production incident on our side, not from a theoretical optimization.

According to Docker docs, depends_on only guarantees dependent restart on explicit Compose operations (e.g. docker compose restart), not all runtime lifecycle events. For crash/OOM/runtime restarts, behavior is controlled by Docker restart policies (always, unless-stopped, on-failure).

Agreed, but as I've stated in #3894, the real fix would be to understand why relay has been crashed, not pushing latency to all users in self-hosted.

For DNS portability, we can avoid hardcoding by retrieving resolver IPs from /etc/resolv.conf.

Answered above, it does not work.

References: https://docs.docker.com/compose/how-tos/startup-order/ -> “restart: true ensures that if db is updated or restarted due to an explicit Compose operation, for example docker compose restart" docker/compose#12477 (comment) -> systemctl restart docker is unrelated to docker compose.

You're correct. Current startup-order does not handle all the problems which may arise in self-hosted, but I still do not think we should use resolver in all self-hosted installations of all users.

But I'm not a maintainer of this project and I only state my opinions here, my word is not final :)

@aldy505 What do you think?

ilias-115 · 2026-04-26T20:54:44Z

Sorry, I've mentioned the wrong pull request. Here is the issue: #3894
I see the workaround with the depends_on but somehow it didn't worked on our case. I think it's not restarted when web or relay are recreated after a crash.
I get the point with the DNS server which can vary depending on host, IMO we have 2 ways:
* retrieve from `/etc/resolv.conf`

* use the old fashion way:
Using DNS servers from /etc/resolv.conf does not work in compose, your upstream DNS server in /etc/resolv.conf do not know what is web and relay.
set $upstream "http://relay:3000";
proxy_pass $upstream;
Simply bad practice, you should use resolver when you need this behaviour, do not push random configurations to push your PR.

This has been discussed in #4079 and has been decided not to add this then.

This PR comes from a real production incident on our side, not from a theoretical optimization.
According to Docker docs, depends_on only guarantees dependent restart on explicit Compose operations (e.g. docker compose restart), not all runtime lifecycle events. For crash/OOM/runtime restarts, behavior is controlled by Docker restart policies (always, unless-stopped, on-failure).

Agreed, but as I've stated in #3894, the real fix would be to understand why relay has been crashed, not pushing latency to all users in self-hosted.

For DNS portability, we can avoid hardcoding by retrieving resolver IPs from /etc/resolv.conf.

Answered above, it does not work.

References: https://docs.docker.com/compose/how-tos/startup-order/ -> “restart: true ensures that if db is updated or restarted due to an explicit Compose operation, for example docker compose restart" docker/compose#12477 (comment) -> systemctl restart docker is unrelated to docker compose.

You're correct. Current startup-order does not handle all the problems which may arise in self-hosted, but I still do not think we should use resolver in all self-hosted installations of all users.

But I'm not a maintainer of this project and I only state my opinions here, my word is not final :)

@aldy505 What do you think?

Thanks for the detailed feedback — this makes sense.

Just to clarify one point: when I mentioned retrieving DNS from /etc/resolv.conf, I meant the file inside the nginx container, not on the host. In Docker user-defined networks this is typically Docker’s embedded DNS, which resolves web and relay service names.

Also, we hit this not only after crashes, but during normal operations too (VM reboot / Docker daemon restart). So this is not only a “fix crash root cause” topic — it is also an upstream DNS resilience issue for expected runtime events where container IPs can change.

I agree root-cause work is still needed, but it does not remove the need for robust upstream re-resolution in nginx.

If changing default behavior is out of scope, I’m happy to re-scope this as:

an opt-in behavior
documentation-backed workaround for users affected by stale upstream IPs.

fix(nginx): add resolver and upstream resolve to prevent stale IP rou…

098a3f4

…ting

github-project-automation Bot added this to Self-hosted Sentry Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(nginx): add resolver and upstream resolve to prevent stale IP rou…#4295

fix(nginx): add resolver and upstream resolve to prevent stale IP rou…#4295
ilias-115 wants to merge 1 commit intogetsentry:masterfrom
ilias-115:fix/nginx-upstream-dns-refresh

ilias-115 commented Apr 21, 2026

Uh oh!

aminvakil commented Apr 23, 2026

Uh oh!

moroine commented Apr 24, 2026

Uh oh!

ilias-115 commented Apr 24, 2026 •

edited

Loading

Uh oh!

aminvakil commented Apr 25, 2026

Uh oh!

ilias-115 commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ilias-115 commented Apr 21, 2026

Why

What changed

Config details

Why these values

Test plan

Notes

Legal Boilerplate

Uh oh!

aminvakil commented Apr 23, 2026

Uh oh!

moroine commented Apr 24, 2026

Uh oh!

ilias-115 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aminvakil commented Apr 25, 2026

Uh oh!

ilias-115 commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilias-115 commented Apr 24, 2026 •

edited

Loading