From 6bfaf2525921b41df5af8450a00d7295d16fb64c Mon Sep 17 00:00:00 2001 From: Fortune-Ndlovu Date: Mon, 29 Jun 2026 11:50:29 +0100 Subject: [PATCH 1/4] docs: fix lock file deadlock docs for dynamic plugins cache - Fix wrong lock file name in rm command (RHDHBUGS-450) - Document 10-min timeout and CrashLoopBackOff behavior - Add DYNAMIC_PLUGINS_LOCK_TIMEOUT_MS env var - Clarify lock only applies with persistent volumes --- docs/dynamic-plugins/installing-plugins.md | 27 +++++++++++++++++----- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/docs/dynamic-plugins/installing-plugins.md b/docs/dynamic-plugins/installing-plugins.md index bbaea0d48b..78d0d1c200 100644 --- a/docs/dynamic-plugins/installing-plugins.md +++ b/docs/dynamic-plugins/installing-plugins.md @@ -287,23 +287,38 @@ When using the Operator .... The directory where dynamic plugins are located is mounted as a volume to the `install-dynamic-plugins` init container and the `backstage-backend` container. The `install-dynamic-plugins` init container is responsible for downloading and extracting the plugins into this directory. Depending on the deployment method, the directory is mounted as an ephemeral or persistent volume. In the latter case, the volume can be shared between several Pods, and the plugins installation script is also responsible for downloading and extracting the plugins only once, avoiding conflicts. -**Important Note:** If `install-dynamic-plugins` init container was killed with SIGKILL signal, which may happen due to the following reasons: +**Important Note:** When the `dynamic-plugins-root` directory is backed by a persistent volume, the `install-dynamic-plugins` init container uses a lock file (`/dynamic-plugins-root/install-dynamic-plugins.lock`) to prevent concurrent plugin installations across Pods that share the same volume. The lock is acquired before installation begins and released when it completes (or fails). + +If the `install-dynamic-plugins` init container is killed with a SIGKILL signal, the lock file cannot be cleaned up. This may happen due to the following reasons: - pod eviction (to free up node resources) -- pod deletion (if not terminated with SIGTERM within graceful period) +- pod deletion (if not terminated with SIGTERM within the graceful period) - node shutdown - container runtime issues - exceeding resource limits (OOM for example) -Then the script will not be able to remove the lock file, so the next time the pod starts, it will be be stuck waiting for the lock to release. You will see the following message in the logs for the init `install-dynamic-plugins` container: +When this occurs, the next pod to start will wait up to **10 minutes** (by default) for the stale lock to be released, logging the following message every second: ```console oc logs -n -f backstage-- -c install-dynamic-plugins -======= Waiting for lock release (file: /dynamic-plugins-root/install-dynamic-plugins.lock)... +======= Waiting for lock to be released: /dynamic-plugins-root/install-dynamic-plugins.lock +``` + +After the timeout expires, the init container exits with an error: + ``` +Timed out after 600000ms waiting for lock file /dynamic-plugins-root/install-dynamic-plugins.lock. +Another install may be stuck — remove the file manually to proceed. +``` + +The pod then enters a CrashLoopBackOff cycle, restarting and waiting again every 10 minutes, until the stale lock file is manually removed. -In such a case, you can delete the lock file manually from any of the Pods: +To resolve this, delete the lock file from any of the Pods: ```console -oc exec -n deploy/backstage- -c install-dynamic-plugins -- rm -f /dynamic-plugins-root/dynamic-plugins.lock +oc exec -n deploy/backstage- -c install-dynamic-plugins -- rm -f /dynamic-plugins-root/install-dynamic-plugins.lock ``` + +The lock timeout can be configured via the `DYNAMIC_PLUGINS_LOCK_TIMEOUT_MS` environment variable on the `install-dynamic-plugins` init container (value in milliseconds, default: `600000` which is 10 minutes). + +Note: This lock file behavior only applies when using a persistent volume for the `dynamic-plugins-root` directory. With the default ephemeral volume, each pod gets its own volume, so no lock contention can occur. From 737bc98758fbcf3236cb090f641053d2dc3d75b1 Mon Sep 17 00:00:00 2001 From: Fortune-Ndlovu Date: Wed, 1 Jul 2026 17:48:47 +0100 Subject: [PATCH 2/4] docs: fix lock file auto-recovery behavior and add 1.10.x note Remove incorrect CrashLoopBackOff claim. The exit handler auto-cleans the stale lock on shutdown so the pod self-heals after one timeout cycle. Reframe manual rm as optional shortcut. Add note that 1.10.x and earlier had no timeout (Python script, infinite wait). Signed-off-by: Fortune Ndlovu --- docs/dynamic-plugins/installing-plugins.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/dynamic-plugins/installing-plugins.md b/docs/dynamic-plugins/installing-plugins.md index 78d0d1c200..19208061a0 100644 --- a/docs/dynamic-plugins/installing-plugins.md +++ b/docs/dynamic-plugins/installing-plugins.md @@ -297,7 +297,7 @@ If the `install-dynamic-plugins` init container is killed with a SIGKILL signal, - container runtime issues - exceeding resource limits (OOM for example) -When this occurs, the next pod to start will wait up to **10 minutes** (by default) for the stale lock to be released, logging the following message every second: +When this occurs, the next pod to start will wait up to **10 minutes** (by default) for the stale lock to be released, logging the following message: ```console oc logs -n -f backstage-- -c install-dynamic-plugins @@ -311,9 +311,9 @@ Timed out after 600000ms waiting for lock file /dynamic-plugins-root/install-dyn Another install may be stuck — remove the file manually to proceed. ``` -The pod then enters a CrashLoopBackOff cycle, restarting and waiting again every 10 minutes, until the stale lock file is manually removed. +The exit handler automatically removes the stale lock file during shutdown. The pod restarts, and the next init container run starts with no lock file present, so it proceeds normally. The total recovery time equals the configured lock timeout (10 minutes by default). No manual intervention is required. -To resolve this, delete the lock file from any of the Pods: +To skip the timeout wait and recover immediately, delete the lock file manually: ```console oc exec -n deploy/backstage- -c install-dynamic-plugins -- rm -f /dynamic-plugins-root/install-dynamic-plugins.lock @@ -321,4 +321,6 @@ oc exec -n deploy/backstage- -c install-dynamic The lock timeout can be configured via the `DYNAMIC_PLUGINS_LOCK_TIMEOUT_MS` environment variable on the `install-dynamic-plugins` init container (value in milliseconds, default: `600000` which is 10 minutes). +> **Note:** In RHDH 1.10.x and earlier, the install script used a Python implementation with no lock timeout. A stale lock file would cause the init container to wait indefinitely, and the only way to recover was to manually delete the lock file. The timeout, configurable environment variable, and automatic lock cleanup on exit were introduced with the TypeScript rewrite of the install script. + Note: This lock file behavior only applies when using a persistent volume for the `dynamic-plugins-root` directory. With the default ephemeral volume, each pod gets its own volume, so no lock contention can occur. From 533837a499b4282acd1d2f9c7fbb36e53b62e31c Mon Sep 17 00:00:00 2001 From: Fortune-Ndlovu Date: Sat, 4 Jul 2026 14:38:36 +0100 Subject: [PATCH 3/4] docs: clarify lock file is always created, contention requires persistent volume Address review feedback: the lock file is created unconditionally by the init container, not only when a persistent volume is used. Lock contention is what requires a shared persistent volume. Also fix Note formatting consistency. Signed-off-by: Fortune Ndlovu --- docs/dynamic-plugins/installing-plugins.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/dynamic-plugins/installing-plugins.md b/docs/dynamic-plugins/installing-plugins.md index 19208061a0..9afdf25728 100644 --- a/docs/dynamic-plugins/installing-plugins.md +++ b/docs/dynamic-plugins/installing-plugins.md @@ -287,7 +287,7 @@ When using the Operator .... The directory where dynamic plugins are located is mounted as a volume to the `install-dynamic-plugins` init container and the `backstage-backend` container. The `install-dynamic-plugins` init container is responsible for downloading and extracting the plugins into this directory. Depending on the deployment method, the directory is mounted as an ephemeral or persistent volume. In the latter case, the volume can be shared between several Pods, and the plugins installation script is also responsible for downloading and extracting the plugins only once, avoiding conflicts. -**Important Note:** When the `dynamic-plugins-root` directory is backed by a persistent volume, the `install-dynamic-plugins` init container uses a lock file (`/dynamic-plugins-root/install-dynamic-plugins.lock`) to prevent concurrent plugin installations across Pods that share the same volume. The lock is acquired before installation begins and released when it completes (or fails). +**Important Note:** The `install-dynamic-plugins` init container always acquires a lock file (`/dynamic-plugins-root/install-dynamic-plugins.lock`) before installing plugins. The lock prevents concurrent installations and is released when the process completes (or fails). Lock contention — where one pod waits for another's lock — only occurs when the `dynamic-plugins-root` directory is backed by a persistent volume shared between pods. If the `install-dynamic-plugins` init container is killed with a SIGKILL signal, the lock file cannot be cleaned up. This may happen due to the following reasons: @@ -323,4 +323,4 @@ The lock timeout can be configured via the `DYNAMIC_PLUGINS_LOCK_TIMEOUT_MS` env > **Note:** In RHDH 1.10.x and earlier, the install script used a Python implementation with no lock timeout. A stale lock file would cause the init container to wait indefinitely, and the only way to recover was to manually delete the lock file. The timeout, configurable environment variable, and automatic lock cleanup on exit were introduced with the TypeScript rewrite of the install script. -Note: This lock file behavior only applies when using a persistent volume for the `dynamic-plugins-root` directory. With the default ephemeral volume, each pod gets its own volume, so no lock contention can occur. +> **Note:** Lock contention only applies when using a persistent volume for the `dynamic-plugins-root` directory. With the default ephemeral volume, each pod gets its own volume, so no lock contention can occur. From 61342a4cae01badd555e2a9acca2c92a104372b1 Mon Sep 17 00:00:00 2001 From: Fortune-Ndlovu Date: Sat, 4 Jul 2026 14:43:09 +0100 Subject: [PATCH 4/4] docs: remove redundant notes to tighten PR scope Drop the RHDH 1.10.x historical note and the persistent volume note (already covered in the intro paragraph). Signed-off-by: Fortune Ndlovu --- docs/dynamic-plugins/installing-plugins.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/docs/dynamic-plugins/installing-plugins.md b/docs/dynamic-plugins/installing-plugins.md index 9afdf25728..b8a1c0c975 100644 --- a/docs/dynamic-plugins/installing-plugins.md +++ b/docs/dynamic-plugins/installing-plugins.md @@ -320,7 +320,3 @@ oc exec -n deploy/backstage- -c install-dynamic ``` The lock timeout can be configured via the `DYNAMIC_PLUGINS_LOCK_TIMEOUT_MS` environment variable on the `install-dynamic-plugins` init container (value in milliseconds, default: `600000` which is 10 minutes). - -> **Note:** In RHDH 1.10.x and earlier, the install script used a Python implementation with no lock timeout. A stale lock file would cause the init container to wait indefinitely, and the only way to recover was to manually delete the lock file. The timeout, configurable environment variable, and automatic lock cleanup on exit were introduced with the TypeScript rewrite of the install script. - -> **Note:** Lock contention only applies when using a persistent volume for the `dynamic-plugins-root` directory. With the default ephemeral volume, each pod gets its own volume, so no lock contention can occur.