fix: improve to suitable with onboarding flow by datphamcode295 · Pull Request #2 · lamtuanvu/ldk-node

datphamcode295 · 2026-04-29T08:44:25Z

LDK-Node Fork Modifications

Context

The vendored ./ldk-node checkout (pinned to lamtuanvu/ldk-node@000a3309, branch custom-gossip-data) carries 4 in-tree modifications on top of the pinned commit. They are produced deterministically by infra/scripts/setup-ldk-node.sh, which after cloning runs:

for patch in patches/ldk-node-*.patch; do
  git apply "$patch"
done

Today there is exactly one such file: patches/ldk-node-tx-broadcaster-queue-bump.patch (357 lines), and it contains all 4 modifications — the filename is misleading because it only mentions the queue bump.

So: the working-tree diff is already canonicalized as a patch. Reverse-applying it succeeds (git apply --reverse --check → 0). The question is per modification, can it be dropped without breaking production code paths in this repo?

Verdict matrix:

#	File	Status	Drop = ?
1	`src/lib.rs` `sync_wallets` rewrite	MUST KEEP	Onboarding hangs forever
2	`src/payment/bolt11.rs` `*_with_hints`	MUST KEEP	Inbound payments unroutable
3	`src/config.rs` 10% → 100%	MUST KEEP	Sweep HTLC fails on first try
4	`src/tx_broadcaster.rs` 50 → 500	STRONGLY RECOMMEND KEEP	Silent broadcast loss under sweep load

None are cosmetic. Details below.

1. `src/lib.rs` — `sync_wallets()` reuses node runtime — MUST KEEP

What changes

Upstream:

tokio::task::block_in_place(move || {
    tokio::runtime::Builder::new_multi_thread().enable_all().build().unwrap()
        .block_on(async move { /* chain-source sync */ })
})

Patched:

let runtime = rt_lock.as_ref().ok_or(Error::NotRunning)?;
runtime.block_on(async move { /* chain-source sync */ })

Why upstream is wrong for us specifically

Node::sync_wallets() is exposed as the capability core.lightning.sync_wallets from the ldk-node builtin app (cdylib):

Capability registration: modules/ldk-node/src/capabilities.rs:100, :218
Handler: modules/ldk-node/src/capabilities.rs:1010 → node.sync_wallets()

Capability dispatch in builtin apps runs on the FFI handler thread, which is not a tokio worker thread. tokio::task::block_in_place panics in that environment because it requires a current-thread tokio runtime context. The panic is then silently eaten at the extern "C" cdylib boundary (panics across FFI are UB / aborted-and-swallowed depending on panic = "unwind" config).

Concrete callers that break

All routed through apps/server/src/services/onboarding/lightning_client.rs:443 → invoke("core.lightning.sync_wallets", …):

Caller (file:line)	Failure mode if not patched
`services/onboarding/sweep_handler.rs:381,520,553,1127,1257`	Onboarding sweep retry loop never returns; user funds stuck pending in temp-node wallet
`services/onboarding/orchestration_service.rs:572,1128`	`wait_until_confirmed`-style polls hang; orchestration never advances past “waiting for chain sync”
`services/onboarding/inbound_liquidity_handler.rs:1082`	Buyer-side liquidity request never refreshes balance after seller channel opens
`services/onboarding/channel_setup_handler.rs:393`	Channel-ready poll loop hangs
`services/liquidity_service.rs:102`	LSP `ready` poll hangs
`apps/lps/src/api.rs:535` (`/api/v1/node/sync` HTTP)	LPS HTTP endpoint times out

What you actually see if you drop the patch

Symptom: onboarding sits at "Syncing wallets…" indefinitely. No log line; no error returned. Health probe still healthy.
Repro: run E2E onboarding suite (tests/e2e/...sweep) on a regtest with non-empty mempool. First sync_wallets invocation never resolves.
The corroborating in-repo write-up of the same bug is at apps/server/src/services/onboarding/orchestration_service.rs:198–211.

Why we can't fix it downstream

The runtime construction lives inside a pub method body. There is no extension point on Node, no trait we can override, and no way to inject a runtime. The only fix is to edit the method body.

Upstream-ability

High. Replacing block_in_place + new_multi_thread + block_on with runtime.block_on is strictly more correct (it removes a hidden invariant that the caller already be on a tokio thread). This should be PR'd to LDK.

2. `src/payment/bolt11.rs` — `receive_with_hints` / `receive_for_hash_with_hints` — MUST KEEP

What changes

Adds three new methods on Bolt11Payment:

receive_with_hints(amount, desc, expiry, route_hints)
receive_for_hash_with_hints(amount, desc, expiry, payment_hash, route_hints) (HODL variant)
receive_with_hints_inner(...) (shared impl)

They use InvoiceBuilder directly + keys_manager.get_node_secret_key() for ECDSA-recoverable signing. (This is also why the diff threads Arc<KeysManager> through Bolt11Payment::new in src/lib.rs.)

Why upstream is wrong for us specifically

ChannelManager::create_bolt11_invoice runs candidate inbound channels through sort_and_filter_channels, which drops every channel where forwarding_info == None — i.e. channels whose counterparty has not sent a unicast channel_update to us yet.

In a leaf-LSP topology this is the steady state: LSPs deliberately keep channels private and never broadcast channel_update. Result: LDK returns an invoice with route_hints = []. Any payment that is not single-hop direct (i.e. essentially every real payment) fails to find a route.

Concrete callers that break

modules/ldk-node/src/capabilities.rs:1136 — capability core.lightning.create_invoice_with_hints (regular invoices when this node sits behind an LSP)
modules/ldk-node/src/capabilities.rs:1183 — capability core.lightning.create_hodl_invoice_with_hints — the seller-side HODL invoice for the inbound-liquidity marketplace (apps/server/src/services/liquidity_service.rs:237–243 documents the exact reason: gossip hasn't propagated when the buyer's temp-node sweep runs, so an invoice without manual hints is unroutable)
Hint construction: modules/ldk-node/src/capabilities.rs:1068–1110 (build_manual_route_hints walks node.list_channels() and builds RouteHintHop entries with inbound_htlc_minimum_msat/inbound_htlc_maximum_msat)

What you actually see if you drop the patch

Symptom: created invoices have route_hints = 0 (visible via lightning-cli decodepay). Sender returns RouteNotFound. UI shows "no route to destination".
Affected production flows:
1. Inbound-liquidity marketplace (seller) — every HODL invoice we mint is unroutable; buyer never pays; we never open the channel; user never receives liquidity.
2. Any normal receive while behind an LSP — every receive endpoint silently produces invoices that nobody can pay.
Both flows currently call core.lightning.{create_invoice_with_hints,create_hodl_invoice_with_hints}. Without the new ldk-node methods, those capabilities can't be implemented at all and the modules/ldk-node builtin will fail to compile (it directly invokes node.bolt11_payment().receive_with_hints(...)).

Why we can't fix it downstream

create_bolt11_invoice is the only public Bolt11 builder LDK exposes, the hint filter is unconditional, and Bolt11Payment::receive_inner is pub(crate). The only escape is to bypass it with InvoiceBuilder + node secret, which requires access to KeysManager — a private field on Node. So the new methods have to live in the fork.

Upstream-ability

Medium. Could be sent upstream as Bolt11Payment::receive_with_route_hints(...). Likely contentious because it lets callers ship invoices with hints that don't reflect actual forwarding policy — but the LSP-leaf case is real and worth raising.

3. `src/config.rs` — `max_inbound_htlc_value_in_flight_percent_of_channel = 100` — MUST KEEP

What changes

In default_user_config():

user_config.channel_handshake_config.max_inbound_htlc_value_in_flight_percent_of_channel = 100;

LDK default = 10.

Why upstream is wrong for us specifically

This setting is negotiated at channel open and baked into the channel forever. It caps any single in-flight forwarded HTLC at N% of channel capacity; over-cap forwards return temporary_channel_failure.

The onboarding sweep flow (apps/server/src/services/onboarding/sweep_handler.rs) opens one temp-node ↔ LSP channel and then deliberately pushes the user's sweep — empirically ~70% of channel capacity — through that single channel as one HTLC. With LDK's default 10%, the first HTLC is rejected and the sweep fails before it can split.

LDK upstream already overrides this knob to 100% in Builder::build_with_store when LSPS2 client/service is enabled. This patch just generalises that override to all channels — which is what we need because the sweep doesn't go through LSPS2.

Concrete callers / paths that break

Sweep HTLC construction in services/onboarding/sweep_handler.rs:664 (bolt11_payment().send())
Any large single-HTLC forward through a channel opened by this node — including L402-paid bursts and AI-agent-driven workflows

What you actually see if you drop the patch

Symptom: sweep returns temporary_channel_failure; LDK retries don't help because the cap is per-channel.
Workaround: split the sweep into ≤10% chunks (~10 sequential HTLCs). That's much slower and exposes the user to partial-failure recovery scenarios.
Concretely, services/onboarding/sweep_handler.rs does not implement chunked retry for this case, so onboarding fails outright.

Why we can't fix it downstream

default_user_config constructs the UserConfig baked into channels at open time. There is no public mutator on a built ChannelManager to change per-channel HTLC caps after open. We'd have to wrap channel opening end-to-end with our own UserConfig — which means duplicating Builder::build_with_store logic and the LSPS2 overrides.

Upstream-ability

Easy. This is a one-line default change; LDK already does it for the LSPS2 path. The only friction is the upstream maintainers may want it gated on a config flag rather than unconditional.

4. `src/tx_broadcaster.rs` — `BCAST_PACKAGE_QUEUE_SIZE: 50 → 500` — STRONGLY RECOMMEND KEEP

What changes

The internal mpsc buffer holding broadcast packages grows from 50 → 500 entries.

Why upstream is wrong for us specifically

LDK's onchain claim/bump task re-broadcasts force-close commitment-txs on every block. Bitcoind rejects them with Transaction outputs already in utxo set, but LDK keeps retrying. The broadcaster uses try_send (non-blocking); when the queue is full, packages are silently dropped with SendError::Full.

If a previous regtest run, integration test, or even a real production force-close left a stuck commitment-tx in the queue, the queue can hit 50 within a few blocks. Subsequent one-shot broadcasts (sweep tx, funding tx, channel update commitments) get dropped without any error path to the caller.

Concrete callers / paths that break

services/onboarding/sweep_handler.rs funding-tx broadcast — onboarding silently stalls; user sees "waiting for confirmation" but the tx was never broadcast.
Anyone running tests against a fresh regtest reuse — cargo test or E2E reruns intermittently fail with no useful log because the dropped package never even reaches the broadcaster's logging path.
Production: any node with a stuck force-close (common after wallet restore or peer downtime) becomes unable to broadcast new sweeps.

What you actually see if you drop the patch

Symptom: nothing in logs. Sweep / funding tx never appears in mempool. Frontend shows perpetual "broadcasting…".
Repro: open + force-close a channel against an unreachable peer, then attempt onboarding from same node. The onboarding funding tx randomly never lands.

Why we can't fix it downstream

BCAST_PACKAGE_QUEUE_SIZE is a const, the channel and capacity are set in TransactionBroadcaster::new, and TransactionBroadcaster is pub(crate). Zero downstream entry points.

Risk of keeping it

500 packages × ~250 bytes/tx ≈ 125 KiB extra worst-case RAM. Negligible.

Upstream-ability

Easy if framed correctly. The real upstream fix is "prefer blocking-send / log-and-coalesce on full" rather than "make the queue bigger" — try_send silently dropping is the underlying bug. But bumping the constant is a strict upgrade and a reasonable interim fix to PR.

lamtuanvu · 2026-05-02T04:31:19Z

+// enough that legitimate sweep/funding broadcasts always make it through
+// even when an old monitor's commitment-tx is stuck looping against
+// bitcoind's "Transaction outputs already in utxo set" rejection.
+const BCAST_PACKAGE_QUEUE_SIZE: usize = 500;


I'm a bit hesitate with directly change the default constant that might conflict with original purpose of that value. if possible, why don't we make it as a configurable value that default value can be override by the consumer?

lamtuanvu · 2026-05-02T04:42:50Z

+	// cap to 100% for every channel this node negotiates (both as opener
+	// and as accepting peer). This matches the override already applied
+	// when we are an LSPS2 client/service (see `Builder::build_with_store`).
+	user_config.channel_handshake_config.max_inbound_htlc_value_in_flight_percent_of_channel = 100;


I prefer keeping the default value but having it a capability to override is better to not breaking other behaviors? I wonder if the 10% capacity is on purpose for some reason?

…sion

The ChannelFunding confirmation target resolved to a 12-block fee tier, which during normal mempool congestion maps to a rate low enough that funding transactions can sit unconfirmed for hours. The channel never reaches channel_ready and the UI is stuck on `sync`. Lower the target to ~3 blocks (mempool's "fast" tier) so funding txs are mined promptly. The fee is still sourced from the chain source's recommended estimates (esplora get_fee_estimates / electrum estimatefee / bitcoind estimatesmartfee) — only the targeted confirmation window changes, and only for funding transactions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lamtuanvu and others added 4 commits July 3, 2025 21:48

added the ability to customize the gossip data

d57c1a8

Merge branch 'main' into custom-gossip-data

1932d8c

added sync events action

000a330

fix: improve to suitable with onboarding flow

8e1599b

lamtuanvu requested changes May 2, 2026

View reviewed changes

datphamcode295 and others added 2 commits May 8, 2026 23:59

fix: opt in to forwarding HTLCs over private channelsupport paris ver…

a6e8231

…sion

lamtuanvu force-pushed the custom-gossip-data branch from 000a330 to 4114c6c Compare June 10, 2026 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve to suitable with onboarding flow#2

fix: improve to suitable with onboarding flow#2
datphamcode295 wants to merge 6 commits into
lamtuanvu:custom-gossip-datafrom
datphamcode295:fix/improve-to-fit-with-onboarding-flow

datphamcode295 commented Apr 29, 2026 •

edited

Loading

Uh oh!

lamtuanvu May 2, 2026

Uh oh!

lamtuanvu May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

datphamcode295 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LDK-Node Fork Modifications

Context

1. src/lib.rs — sync_wallets() reuses node runtime — MUST KEEP

What changes

Why upstream is wrong for us specifically

Concrete callers that break

What you actually see if you drop the patch

Why we can't fix it downstream

Upstream-ability

2. src/payment/bolt11.rs — receive_with_hints / receive_for_hash_with_hints — MUST KEEP

What changes

Why upstream is wrong for us specifically

Concrete callers that break

What you actually see if you drop the patch

Why we can't fix it downstream

Upstream-ability

3. src/config.rs — max_inbound_htlc_value_in_flight_percent_of_channel = 100 — MUST KEEP

What changes

Why upstream is wrong for us specifically

Concrete callers / paths that break

What you actually see if you drop the patch

Why we can't fix it downstream

Upstream-ability

4. src/tx_broadcaster.rs — BCAST_PACKAGE_QUEUE_SIZE: 50 → 500 — STRONGLY RECOMMEND KEEP

What changes

Why upstream is wrong for us specifically

Concrete callers / paths that break

What you actually see if you drop the patch

Why we can't fix it downstream

Risk of keeping it

Upstream-ability

Uh oh!

lamtuanvu May 2, 2026

Choose a reason for hiding this comment

Uh oh!

lamtuanvu May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

datphamcode295 commented Apr 29, 2026 •

edited

Loading

1. `src/lib.rs` — `sync_wallets()` reuses node runtime — MUST KEEP

2. `src/payment/bolt11.rs` — `receive_with_hints` / `receive_for_hash_with_hints` — MUST KEEP

3. `src/config.rs` — `max_inbound_htlc_value_in_flight_percent_of_channel = 100` — MUST KEEP

4. `src/tx_broadcaster.rs` — `BCAST_PACKAGE_QUEUE_SIZE: 50 → 500` — STRONGLY RECOMMEND KEEP