fix: improve to suitable with onboarding flow#2
Open
datphamcode295 wants to merge 6 commits into
Open
Conversation
lamtuanvu
requested changes
May 2, 2026
| // enough that legitimate sweep/funding broadcasts always make it through | ||
| // even when an old monitor's commitment-tx is stuck looping against | ||
| // bitcoind's "Transaction outputs already in utxo set" rejection. | ||
| const BCAST_PACKAGE_QUEUE_SIZE: usize = 500; |
Owner
There was a problem hiding this comment.
I'm a bit hesitate with directly change the default constant that might conflict with original purpose of that value. if possible, why don't we make it as a configurable value that default value can be override by the consumer?
| // cap to 100% for every channel this node negotiates (both as opener | ||
| // and as accepting peer). This matches the override already applied | ||
| // when we are an LSPS2 client/service (see `Builder::build_with_store`). | ||
| user_config.channel_handshake_config.max_inbound_htlc_value_in_flight_percent_of_channel = 100; |
Owner
There was a problem hiding this comment.
I prefer keeping the default value but having it a capability to override is better to not breaking other behaviors? I wonder if the 10% capacity is on purpose for some reason?
The ChannelFunding confirmation target resolved to a 12-block fee tier, which during normal mempool congestion maps to a rate low enough that funding transactions can sit unconfirmed for hours. The channel never reaches channel_ready and the UI is stuck on `sync`. Lower the target to ~3 blocks (mempool's "fast" tier) so funding txs are mined promptly. The fee is still sourced from the chain source's recommended estimates (esplora get_fee_estimates / electrum estimatefee / bitcoind estimatesmartfee) — only the targeted confirmation window changes, and only for funding transactions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
000a330 to
4114c6c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
LDK-Node Fork Modifications
Context
The vendored
./ldk-nodecheckout (pinned tolamtuanvu/ldk-node@000a3309, branchcustom-gossip-data) carries 4 in-tree modifications on top of the pinned commit. They are produced deterministically byinfra/scripts/setup-ldk-node.sh, which after cloning runs:Today there is exactly one such file:
patches/ldk-node-tx-broadcaster-queue-bump.patch(357 lines), and it contains all 4 modifications — the filename is misleading because it only mentions the queue bump.So: the working-tree diff is already canonicalized as a patch. Reverse-applying it succeeds (
git apply --reverse --check → 0). The question is per modification, can it be dropped without breaking production code paths in this repo?Verdict matrix:
src/lib.rssync_walletsrewritesrc/payment/bolt11.rs*_with_hintssrc/config.rs10% → 100%src/tx_broadcaster.rs50 → 500None are cosmetic. Details below.
1.
src/lib.rs—sync_wallets()reuses node runtime — MUST KEEPWhat changes
Upstream:
Patched:
Why upstream is wrong for us specifically
Node::sync_wallets()is exposed as the capabilitycore.lightning.sync_walletsfrom the ldk-node builtin app (cdylib):modules/ldk-node/src/capabilities.rs:100,:218modules/ldk-node/src/capabilities.rs:1010→node.sync_wallets()Capability dispatch in builtin apps runs on the FFI handler thread, which is not a tokio worker thread.
tokio::task::block_in_placepanics in that environment because it requires a current-thread tokio runtime context. The panic is then silently eaten at theextern "C"cdylib boundary (panics across FFI are UB / aborted-and-swallowed depending onpanic = "unwind"config).Concrete callers that break
All routed through
apps/server/src/services/onboarding/lightning_client.rs:443→invoke("core.lightning.sync_wallets", …):services/onboarding/sweep_handler.rs:381,520,553,1127,1257services/onboarding/orchestration_service.rs:572,1128wait_until_confirmed-style polls hang; orchestration never advances past “waiting for chain sync”services/onboarding/inbound_liquidity_handler.rs:1082services/onboarding/channel_setup_handler.rs:393services/liquidity_service.rs:102readypoll hangsapps/lps/src/api.rs:535(/api/v1/node/syncHTTP)What you actually see if you drop the patch
tests/e2e/...sweep) on a regtest with non-empty mempool. Firstsync_walletsinvocation never resolves.apps/server/src/services/onboarding/orchestration_service.rs:198–211.Why we can't fix it downstream
The runtime construction lives inside a
pubmethod body. There is no extension point onNode, no trait we can override, and no way to inject a runtime. The only fix is to edit the method body.Upstream-ability
High. Replacing
block_in_place + new_multi_thread + block_onwithruntime.block_onis strictly more correct (it removes a hidden invariant that the caller already be on a tokio thread). This should be PR'd to LDK.2.
src/payment/bolt11.rs—receive_with_hints/receive_for_hash_with_hints— MUST KEEPWhat changes
Adds three new methods on
Bolt11Payment:receive_with_hints(amount, desc, expiry, route_hints)receive_for_hash_with_hints(amount, desc, expiry, payment_hash, route_hints)(HODL variant)receive_with_hints_inner(...)(shared impl)They use
InvoiceBuilderdirectly +keys_manager.get_node_secret_key()for ECDSA-recoverable signing. (This is also why the diff threadsArc<KeysManager>throughBolt11Payment::newinsrc/lib.rs.)Why upstream is wrong for us specifically
ChannelManager::create_bolt11_invoiceruns candidate inbound channels throughsort_and_filter_channels, which drops every channel whereforwarding_info == None— i.e. channels whose counterparty has not sent a unicastchannel_updateto us yet.In a leaf-LSP topology this is the steady state: LSPs deliberately keep channels private and never broadcast
channel_update. Result: LDK returns an invoice withroute_hints = []. Any payment that is not single-hop direct (i.e. essentially every real payment) fails to find a route.Concrete callers that break
modules/ldk-node/src/capabilities.rs:1136— capabilitycore.lightning.create_invoice_with_hints(regular invoices when this node sits behind an LSP)modules/ldk-node/src/capabilities.rs:1183— capabilitycore.lightning.create_hodl_invoice_with_hints— the seller-side HODL invoice for the inbound-liquidity marketplace (apps/server/src/services/liquidity_service.rs:237–243documents the exact reason: gossip hasn't propagated when the buyer's temp-node sweep runs, so an invoice without manual hints is unroutable)modules/ldk-node/src/capabilities.rs:1068–1110(build_manual_route_hintswalksnode.list_channels()and buildsRouteHintHopentries withinbound_htlc_minimum_msat/inbound_htlc_maximum_msat)What you actually see if you drop the patch
route_hints = 0(visible vialightning-cli decodepay). Sender returnsRouteNotFound. UI shows "no route to destination".receivewhile behind an LSP — every receive endpoint silently produces invoices that nobody can pay.core.lightning.{create_invoice_with_hints,create_hodl_invoice_with_hints}. Without the new ldk-node methods, those capabilities can't be implemented at all and the modules/ldk-node builtin will fail to compile (it directly invokesnode.bolt11_payment().receive_with_hints(...)).Why we can't fix it downstream
create_bolt11_invoiceis the only public Bolt11 builder LDK exposes, the hint filter is unconditional, andBolt11Payment::receive_innerispub(crate). The only escape is to bypass it withInvoiceBuilder+ node secret, which requires access toKeysManager— a private field onNode. So the new methods have to live in the fork.Upstream-ability
Medium. Could be sent upstream as
Bolt11Payment::receive_with_route_hints(...). Likely contentious because it lets callers ship invoices with hints that don't reflect actual forwarding policy — but the LSP-leaf case is real and worth raising.3.
src/config.rs—max_inbound_htlc_value_in_flight_percent_of_channel = 100— MUST KEEPWhat changes
In
default_user_config():LDK default = 10.
Why upstream is wrong for us specifically
This setting is negotiated at channel open and baked into the channel forever. It caps any single in-flight forwarded HTLC at N% of channel capacity; over-cap forwards return
temporary_channel_failure.The onboarding sweep flow (
apps/server/src/services/onboarding/sweep_handler.rs) opens one temp-node ↔ LSP channel and then deliberately pushes the user's sweep — empirically ~70% of channel capacity — through that single channel as one HTLC. With LDK's default 10%, the first HTLC is rejected and the sweep fails before it can split.LDK upstream already overrides this knob to 100% in
Builder::build_with_storewhen LSPS2 client/service is enabled. This patch just generalises that override to all channels — which is what we need because the sweep doesn't go through LSPS2.Concrete callers / paths that break
services/onboarding/sweep_handler.rs:664(bolt11_payment().send())What you actually see if you drop the patch
temporary_channel_failure; LDK retries don't help because the cap is per-channel.services/onboarding/sweep_handler.rsdoes not implement chunked retry for this case, so onboarding fails outright.Why we can't fix it downstream
default_user_configconstructs theUserConfigbaked into channels at open time. There is no public mutator on a builtChannelManagerto change per-channel HTLC caps after open. We'd have to wrap channel opening end-to-end with our ownUserConfig— which means duplicatingBuilder::build_with_storelogic and the LSPS2 overrides.Upstream-ability
Easy. This is a one-line default change; LDK already does it for the LSPS2 path. The only friction is the upstream maintainers may want it gated on a config flag rather than unconditional.
4.
src/tx_broadcaster.rs—BCAST_PACKAGE_QUEUE_SIZE: 50 → 500— STRONGLY RECOMMEND KEEPWhat changes
The internal
mpscbuffer holding broadcast packages grows from 50 → 500 entries.Why upstream is wrong for us specifically
LDK's onchain claim/bump task re-broadcasts force-close commitment-txs on every block. Bitcoind rejects them with
Transaction outputs already in utxo set, but LDK keeps retrying. The broadcaster usestry_send(non-blocking); when the queue is full, packages are silently dropped withSendError::Full.If a previous regtest run, integration test, or even a real production force-close left a stuck commitment-tx in the queue, the queue can hit 50 within a few blocks. Subsequent one-shot broadcasts (sweep tx, funding tx, channel update commitments) get dropped without any error path to the caller.
Concrete callers / paths that break
services/onboarding/sweep_handler.rsfunding-tx broadcast — onboarding silently stalls; user sees "waiting for confirmation" but the tx was never broadcast.cargo testor E2E reruns intermittently fail with no useful log because the dropped package never even reaches the broadcaster's logging path.What you actually see if you drop the patch
Why we can't fix it downstream
BCAST_PACKAGE_QUEUE_SIZEis aconst, the channel and capacity are set inTransactionBroadcaster::new, andTransactionBroadcasterispub(crate). Zero downstream entry points.Risk of keeping it
500 packages × ~250 bytes/tx ≈ 125 KiB extra worst-case RAM. Negligible.
Upstream-ability
Easy if framed correctly. The real upstream fix is "prefer blocking-send / log-and-coalesce on full" rather than "make the queue bigger" —
try_sendsilently dropping is the underlying bug. But bumping the constant is a strict upgrade and a reasonable interim fix to PR.