Skip to content

fix: improve to suitable with onboarding flow#2

Open
datphamcode295 wants to merge 6 commits into
lamtuanvu:custom-gossip-datafrom
datphamcode295:fix/improve-to-fit-with-onboarding-flow
Open

fix: improve to suitable with onboarding flow#2
datphamcode295 wants to merge 6 commits into
lamtuanvu:custom-gossip-datafrom
datphamcode295:fix/improve-to-fit-with-onboarding-flow

Conversation

@datphamcode295

@datphamcode295 datphamcode295 commented Apr 29, 2026

Copy link
Copy Markdown

LDK-Node Fork Modifications

Context

The vendored ./ldk-node checkout (pinned to lamtuanvu/ldk-node@000a3309, branch custom-gossip-data) carries 4 in-tree modifications on top of the pinned commit. They are produced deterministically by infra/scripts/setup-ldk-node.sh, which after cloning runs:

for patch in patches/ldk-node-*.patch; do
  git apply "$patch"
done

Today there is exactly one such file: patches/ldk-node-tx-broadcaster-queue-bump.patch (357 lines), and it contains all 4 modifications — the filename is misleading because it only mentions the queue bump.

So: the working-tree diff is already canonicalized as a patch. Reverse-applying it succeeds (git apply --reverse --check → 0). The question is per modification, can it be dropped without breaking production code paths in this repo?

Verdict matrix:

# File Status Drop = ?
1 src/lib.rs sync_wallets rewrite MUST KEEP Onboarding hangs forever
2 src/payment/bolt11.rs *_with_hints MUST KEEP Inbound payments unroutable
3 src/config.rs 10% → 100% MUST KEEP Sweep HTLC fails on first try
4 src/tx_broadcaster.rs 50 → 500 STRONGLY RECOMMEND KEEP Silent broadcast loss under sweep load

None are cosmetic. Details below.


1. src/lib.rssync_wallets() reuses node runtime — MUST KEEP

What changes

Upstream:

tokio::task::block_in_place(move || {
    tokio::runtime::Builder::new_multi_thread().enable_all().build().unwrap()
        .block_on(async move { /* chain-source sync */ })
})

Patched:

let runtime = rt_lock.as_ref().ok_or(Error::NotRunning)?;
runtime.block_on(async move { /* chain-source sync */ })

Why upstream is wrong for us specifically

Node::sync_wallets() is exposed as the capability core.lightning.sync_wallets from the ldk-node builtin app (cdylib):

  • Capability registration: modules/ldk-node/src/capabilities.rs:100, :218
  • Handler: modules/ldk-node/src/capabilities.rs:1010node.sync_wallets()

Capability dispatch in builtin apps runs on the FFI handler thread, which is not a tokio worker thread. tokio::task::block_in_place panics in that environment because it requires a current-thread tokio runtime context. The panic is then silently eaten at the extern "C" cdylib boundary (panics across FFI are UB / aborted-and-swallowed depending on panic = "unwind" config).

Concrete callers that break

All routed through apps/server/src/services/onboarding/lightning_client.rs:443invoke("core.lightning.sync_wallets", …):

Caller (file:line) Failure mode if not patched
services/onboarding/sweep_handler.rs:381,520,553,1127,1257 Onboarding sweep retry loop never returns; user funds stuck pending in temp-node wallet
services/onboarding/orchestration_service.rs:572,1128 wait_until_confirmed-style polls hang; orchestration never advances past “waiting for chain sync”
services/onboarding/inbound_liquidity_handler.rs:1082 Buyer-side liquidity request never refreshes balance after seller channel opens
services/onboarding/channel_setup_handler.rs:393 Channel-ready poll loop hangs
services/liquidity_service.rs:102 LSP ready poll hangs
apps/lps/src/api.rs:535 (/api/v1/node/sync HTTP) LPS HTTP endpoint times out

What you actually see if you drop the patch

  • Symptom: onboarding sits at "Syncing wallets…" indefinitely. No log line; no error returned. Health probe still healthy.
  • Repro: run E2E onboarding suite (tests/e2e/...sweep) on a regtest with non-empty mempool. First sync_wallets invocation never resolves.
  • The corroborating in-repo write-up of the same bug is at apps/server/src/services/onboarding/orchestration_service.rs:198–211.

Why we can't fix it downstream

The runtime construction lives inside a pub method body. There is no extension point on Node, no trait we can override, and no way to inject a runtime. The only fix is to edit the method body.

Upstream-ability

High. Replacing block_in_place + new_multi_thread + block_on with runtime.block_on is strictly more correct (it removes a hidden invariant that the caller already be on a tokio thread). This should be PR'd to LDK.


2. src/payment/bolt11.rsreceive_with_hints / receive_for_hash_with_hintsMUST KEEP

What changes

Adds three new methods on Bolt11Payment:

  • receive_with_hints(amount, desc, expiry, route_hints)
  • receive_for_hash_with_hints(amount, desc, expiry, payment_hash, route_hints) (HODL variant)
  • receive_with_hints_inner(...) (shared impl)

They use InvoiceBuilder directly + keys_manager.get_node_secret_key() for ECDSA-recoverable signing. (This is also why the diff threads Arc<KeysManager> through Bolt11Payment::new in src/lib.rs.)

Why upstream is wrong for us specifically

ChannelManager::create_bolt11_invoice runs candidate inbound channels through sort_and_filter_channels, which drops every channel where forwarding_info == None — i.e. channels whose counterparty has not sent a unicast channel_update to us yet.

In a leaf-LSP topology this is the steady state: LSPs deliberately keep channels private and never broadcast channel_update. Result: LDK returns an invoice with route_hints = []. Any payment that is not single-hop direct (i.e. essentially every real payment) fails to find a route.

Concrete callers that break

  • modules/ldk-node/src/capabilities.rs:1136 — capability core.lightning.create_invoice_with_hints (regular invoices when this node sits behind an LSP)
  • modules/ldk-node/src/capabilities.rs:1183 — capability core.lightning.create_hodl_invoice_with_hints — the seller-side HODL invoice for the inbound-liquidity marketplace (apps/server/src/services/liquidity_service.rs:237–243 documents the exact reason: gossip hasn't propagated when the buyer's temp-node sweep runs, so an invoice without manual hints is unroutable)
  • Hint construction: modules/ldk-node/src/capabilities.rs:1068–1110 (build_manual_route_hints walks node.list_channels() and builds RouteHintHop entries with inbound_htlc_minimum_msat/inbound_htlc_maximum_msat)

What you actually see if you drop the patch

  • Symptom: created invoices have route_hints = 0 (visible via lightning-cli decodepay). Sender returns RouteNotFound. UI shows "no route to destination".
  • Affected production flows:
    1. Inbound-liquidity marketplace (seller) — every HODL invoice we mint is unroutable; buyer never pays; we never open the channel; user never receives liquidity.
    2. Any normal receive while behind an LSP — every receive endpoint silently produces invoices that nobody can pay.
  • Both flows currently call core.lightning.{create_invoice_with_hints,create_hodl_invoice_with_hints}. Without the new ldk-node methods, those capabilities can't be implemented at all and the modules/ldk-node builtin will fail to compile (it directly invokes node.bolt11_payment().receive_with_hints(...)).

Why we can't fix it downstream

create_bolt11_invoice is the only public Bolt11 builder LDK exposes, the hint filter is unconditional, and Bolt11Payment::receive_inner is pub(crate). The only escape is to bypass it with InvoiceBuilder + node secret, which requires access to KeysManager — a private field on Node. So the new methods have to live in the fork.

Upstream-ability

Medium. Could be sent upstream as Bolt11Payment::receive_with_route_hints(...). Likely contentious because it lets callers ship invoices with hints that don't reflect actual forwarding policy — but the LSP-leaf case is real and worth raising.


3. src/config.rsmax_inbound_htlc_value_in_flight_percent_of_channel = 100MUST KEEP

What changes

In default_user_config():

user_config.channel_handshake_config.max_inbound_htlc_value_in_flight_percent_of_channel = 100;

LDK default = 10.

Why upstream is wrong for us specifically

This setting is negotiated at channel open and baked into the channel forever. It caps any single in-flight forwarded HTLC at N% of channel capacity; over-cap forwards return temporary_channel_failure.

The onboarding sweep flow (apps/server/src/services/onboarding/sweep_handler.rs) opens one temp-node ↔ LSP channel and then deliberately pushes the user's sweep — empirically ~70% of channel capacity — through that single channel as one HTLC. With LDK's default 10%, the first HTLC is rejected and the sweep fails before it can split.

LDK upstream already overrides this knob to 100% in Builder::build_with_store when LSPS2 client/service is enabled. This patch just generalises that override to all channels — which is what we need because the sweep doesn't go through LSPS2.

Concrete callers / paths that break

  • Sweep HTLC construction in services/onboarding/sweep_handler.rs:664 (bolt11_payment().send())
  • Any large single-HTLC forward through a channel opened by this node — including L402-paid bursts and AI-agent-driven workflows

What you actually see if you drop the patch

  • Symptom: sweep returns temporary_channel_failure; LDK retries don't help because the cap is per-channel.
  • Workaround: split the sweep into ≤10% chunks (~10 sequential HTLCs). That's much slower and exposes the user to partial-failure recovery scenarios.
  • Concretely, services/onboarding/sweep_handler.rs does not implement chunked retry for this case, so onboarding fails outright.

Why we can't fix it downstream

default_user_config constructs the UserConfig baked into channels at open time. There is no public mutator on a built ChannelManager to change per-channel HTLC caps after open. We'd have to wrap channel opening end-to-end with our own UserConfig — which means duplicating Builder::build_with_store logic and the LSPS2 overrides.

Upstream-ability

Easy. This is a one-line default change; LDK already does it for the LSPS2 path. The only friction is the upstream maintainers may want it gated on a config flag rather than unconditional.


4. src/tx_broadcaster.rsBCAST_PACKAGE_QUEUE_SIZE: 50 → 500STRONGLY RECOMMEND KEEP

What changes

The internal mpsc buffer holding broadcast packages grows from 50 → 500 entries.

Why upstream is wrong for us specifically

LDK's onchain claim/bump task re-broadcasts force-close commitment-txs on every block. Bitcoind rejects them with Transaction outputs already in utxo set, but LDK keeps retrying. The broadcaster uses try_send (non-blocking); when the queue is full, packages are silently dropped with SendError::Full.

If a previous regtest run, integration test, or even a real production force-close left a stuck commitment-tx in the queue, the queue can hit 50 within a few blocks. Subsequent one-shot broadcasts (sweep tx, funding tx, channel update commitments) get dropped without any error path to the caller.

Concrete callers / paths that break

  • services/onboarding/sweep_handler.rs funding-tx broadcast — onboarding silently stalls; user sees "waiting for confirmation" but the tx was never broadcast.
  • Anyone running tests against a fresh regtest reuse — cargo test or E2E reruns intermittently fail with no useful log because the dropped package never even reaches the broadcaster's logging path.
  • Production: any node with a stuck force-close (common after wallet restore or peer downtime) becomes unable to broadcast new sweeps.

What you actually see if you drop the patch

  • Symptom: nothing in logs. Sweep / funding tx never appears in mempool. Frontend shows perpetual "broadcasting…".
  • Repro: open + force-close a channel against an unreachable peer, then attempt onboarding from same node. The onboarding funding tx randomly never lands.

Why we can't fix it downstream

BCAST_PACKAGE_QUEUE_SIZE is a const, the channel and capacity are set in TransactionBroadcaster::new, and TransactionBroadcaster is pub(crate). Zero downstream entry points.

Risk of keeping it

500 packages × ~250 bytes/tx ≈ 125 KiB extra worst-case RAM. Negligible.

Upstream-ability

Easy if framed correctly. The real upstream fix is "prefer blocking-send / log-and-coalesce on full" rather than "make the queue bigger" — try_send silently dropping is the underlying bug. But bumping the constant is a strict upgrade and a reasonable interim fix to PR.

Comment thread src/tx_broadcaster.rs
// enough that legitimate sweep/funding broadcasts always make it through
// even when an old monitor's commitment-tx is stuck looping against
// bitcoind's "Transaction outputs already in utxo set" rejection.
const BCAST_PACKAGE_QUEUE_SIZE: usize = 500;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit hesitate with directly change the default constant that might conflict with original purpose of that value. if possible, why don't we make it as a configurable value that default value can be override by the consumer?

Comment thread src/config.rs
// cap to 100% for every channel this node negotiates (both as opener
// and as accepting peer). This matches the override already applied
// when we are an LSPS2 client/service (see `Builder::build_with_store`).
user_config.channel_handshake_config.max_inbound_htlc_value_in_flight_percent_of_channel = 100;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer keeping the default value but having it a capability to override is better to not breaking other behaviors? I wonder if the 10% capacity is on purpose for some reason?

datphamcode295 and others added 2 commits May 8, 2026 23:59
The ChannelFunding confirmation target resolved to a 12-block fee tier,
which during normal mempool congestion maps to a rate low enough that
funding transactions can sit unconfirmed for hours. The channel never
reaches channel_ready and the UI is stuck on `sync`.

Lower the target to ~3 blocks (mempool's "fast" tier) so funding txs are
mined promptly. The fee is still sourced from the chain source's
recommended estimates (esplora get_fee_estimates / electrum estimatefee /
bitcoind estimatesmartfee) — only the targeted confirmation window
changes, and only for funding transactions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lamtuanvu lamtuanvu force-pushed the custom-gossip-data branch from 000a330 to 4114c6c Compare June 10, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants