You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug report written mostly by Opus 4.7 on my instructions (and Opus did most of the root cause finding)
Title
Parallel dual simplex (PAMI) stalls indefinitely after post-optimality residual rescan; serial EKK resolves the same problem in one extra pivot
Environment
HiGHS: 1.11.0, 1.12.0, 1.14.0 — all three versions reproduce identically
Binding: reproduced via external binary (bin/highs_v1.11, bin/highs_v1.12) and highspy==1.14.0 (Python)
OS: Linux (6.14.0), x86_64, 4 physical cores available (symptom is worst with threads < 8)
Model: ~32 MB free-format MPS (5 MB gzipped) — 207,522 rows × 184,608 cols, 704,825 non-zeros. Pure LP (no integer variables in this reduction). From an energy-system optimization model (FlexTool, rivendell hydro-cascade scenario). Structure is typical of power-system dispatch: mixed demand-balance, min-up/min-down commitment logic, hourly timesteps.
Summary
With parallel=on (default when threads > 1), dual simplex reaches Pr=0, Du=0 on the first pass. A post-optimality quality rescan then reveals a small residual infeasibility (e.g. Pr: 1(1.07e-05)). HiGHS responds by bumping the Markowitz threshold to 0.5 and attempting to clear the residual via further pivots — but the simplex iteration count freezes at the same value and the residual never changes. The solve continues until the time limit.
With parallel=off, the exact same MPS solves to optimality in 63.89 s using one additional pivot (82448 → 82449). The residual is identical; serial EKK simply handles it correctly.
Appears to match the signature reported in #1547 (dormant since Nov 2024) and the e-zaline comment in #1044 (the Windows-specific thread-join hang was fixed in #1942 / 1.8.0, but the PAMI cycling was never addressed).
Reproduction
Command-line (external binary):
# Stalls — runs until time_limit
./highs --model_file flextool.mps --options_file options_parallel_on.txt
# Optimal in ~64 s
./highs --model_file flextool.mps --options_file options_parallel_off.txt
options_parallel_on.txt:
parallel=on
threads=4
time_limit=300
options_parallel_off.txt:
parallel=off
threads=1
time_limit=300
Python (highspy==1.14.0):
importhighspyh=highspy.Highs()
h.setOptionValue("output_flag", True)
h.setOptionValue("log_to_console", True)
h.setOptionValue("parallel", "off") # flip to "on" to reproduce the stallh.setOptionValue("threads", 1) # change to 4 with parallel=onh.setOptionValue("time_limit", 120.0)
h.readModel("flextool.mps")
h.run()
print(h.getModelStatus(), h.getObjectiveValue())
(Note: within a single Python process the global scheduler is initialized once, so on/off comparisons require separate processes — Option 'threads' is set to 1 but global scheduler has already been initialized appears otherwise.)
Observed behavior (parallel=on)
WARNING: Number of threads available = 4 < 8 = Simplex concurrency to be used:
Parallel performance may be less than anticipated
Using EKK parallel dual simplex solver - PAMI with concurrency of 8
Iteration Objective Infeasibilities num(sum)
0 0.0000000000e+00 Pr: 13888(38357.9) 1s
31902 2.7451378939e+00 Pr: 17768(582859); Du: 0(3.00727e-06) 6s
46928 3.2000446137e+00 Pr: 18140(88421.5); Du: 0(2.12556e-06) 11s
55511 3.4904474115e+00 Pr: 15695(1.03267e+06); Du: 0(2.06150e-06) 16s
78341 3.3582641992e+00 Pr: 0(0); Du: 2116(1.46961) 33s
81843 3.3391223839e+00 Pr: 0(0); Du: 544(0.426523) 38s
83295 3.3345044289e+00 Pr: 0(0); Du: 0(8.79675e-12) 41s ← reaches optimum
Using EKK parallel dual simplex solver - PAMI with concurrency of 8
WARNING: Increasing Markowitz threshold to 0.5
83295 3.3345044289e+00 Pr: 1(1.07048e-05) 46s ← quality rescan finds residual
83295 3.3345044289e+00 Pr: 1(1.07048e-05) 51s ← frozen; no progress
83295 3.3345044289e+00 Pr: 1(1.07048e-05) 56s
... residual never changes, iteration never advances, time limit reached at 300 s ...
Expected behavior (parallel=off, same MPS)
Using EKK dual simplex solver - serial
Iteration Objective Infeasibilities num(sum)
0 0.0000000000e+00 Pr: 13888(38357.9) 0s
82448 3.3345045089e+00 Pr: 0(0) 61s
Using EKK dual simplex solver - serial
82448 3.3345045089e+00 Pr: 1(1.07e-05) 61s ← same residual
82449 3.3345045097e+00 Pr: 0(0) 61s ← one pivot clears it
Performed postsolve
Model status : Optimal
Simplex iterations: 82449
HiGHS run time : 63.89
Key observations
Not a version regression — 1.11, 1.12, 1.14 behave identically on the same MPS under parallel=on; all three solve the same MPS trivially under parallel=off. Symptom is the parallel dual simplex algorithm path, not a recent change.
PAMI hardcoded concurrency = 8 matters. The WARNING: threads available = 4 < 8 = Simplex concurrency to be used message consistently appears in stalling runs. simplex_max_concurrency=4 was suggested in Inexplicable performance variation #1547 but did not help (per NPC's report there). Need to verify whether forcing simplex_concurrency=threads or similar would change behavior.
Residual reappears at the same magnitude every rescan. The pivot to clear it succeeds under serial; it is not a fundamentally infeasible problem.
The root-cause appears to be concurrent simplex trajectories disagreeing on how to clear the rescan residual, combined with the Markowitz threshold bump preventing any of them from making the one pivot that serial EKK makes.
What we already tried
parallel=off — works (63.89 s vs indefinite / time-limit-reached).
threads=1 — works (equivalent to parallel=off in practice; HiGHS falls back to serial EKK).
solver=ipm — avoids the stall but is 5–10× slower on this class of LP, and has its own crossover-to-simplex stall pattern.
Upgrading HiGHS — no fix in any tagged release through 1.14.
Workaround adopted
Set parallel=off as the default in the project's solver options file. HiGHS maintainer @jajhall has recommended threads=1 as the canonical workaround in adjacent issues (#780, #1044), consistent with this finding. Until parallel dual simplex is either fixed or removed, we will not raise threads above 1 for this class of model.
Request
Confirm whether this is the same root cause as #1547, or a separate pathology that happens to share the "Markowitz=0.5, iter frozen" signature.
If Inexplicable performance variation #1547 is known and still under consideration, clarify whether a fix is planned for a tagged release or whether PAMI is expected to remain parallel=off de facto in the near term — so downstream projects can make informed defaults.
Reproduces in attached pami_stall.mps.gz. Includes the two options files above, and the full logs (one run each of parallel=on and parallel=off in HiGHS 1.14).
Some approaches you could try (least invasive first):
Serial bail-out on rescan failure
When the post-optimality rescan finds a residual AND the Markowitz bump produces N iterations (e.g., N=3) with no change in simplex iteration count or residual value, declare PAMI stalled and hand the current basis to serial EKK for cleanup. Serial handles this trivially. The user doesn't lose correctness or parallel speedup during the productive phase — only the final cleanup is serialized.
Root-cause-level (largest)
If the residual can be reliably reproduced in the PAMI basis but not the serial basis, there may be a basis-factorization race during HekkDual.cpp's parallel trajectory merge. Issue Avoid highs::parallel::spawn in HekkDual.cpp #2566 hints at this (highs::parallel::spawn noted as fragile in the same file). If concrete debugger data from log_dev_level=3 confirms it's a concurrent modification of the basis factors during the rescan phase, proper synchronization around the final cleanup merge would fix it.
Bug report written mostly by Opus 4.7 on my instructions (and Opus did most of the root cause finding)
Title
Parallel dual simplex (PAMI) stalls indefinitely after post-optimality residual rescan; serial EKK resolves the same problem in one extra pivot
Environment
bin/highs_v1.11,bin/highs_v1.12) andhighspy==1.14.0(Python)Summary
With
parallel=on(default whenthreads > 1), dual simplex reachesPr=0, Du=0on the first pass. A post-optimality quality rescan then reveals a small residual infeasibility (e.g.Pr: 1(1.07e-05)). HiGHS responds by bumping the Markowitz threshold to 0.5 and attempting to clear the residual via further pivots — but the simplex iteration count freezes at the same value and the residual never changes. The solve continues until the time limit.With
parallel=off, the exact same MPS solves to optimality in 63.89 s using one additional pivot (82448 → 82449). The residual is identical; serial EKK simply handles it correctly.Appears to match the signature reported in #1547 (dormant since Nov 2024) and the
e-zalinecomment in #1044 (the Windows-specific thread-join hang was fixed in #1942 / 1.8.0, but the PAMI cycling was never addressed).Reproduction
Command-line (external binary):
options_parallel_on.txt:options_parallel_off.txt:Python (
highspy==1.14.0):(Note: within a single Python process the global scheduler is initialized once, so on/off comparisons require separate processes —
Option 'threads' is set to 1 but global scheduler has already been initializedappears otherwise.)Observed behavior (
parallel=on)Expected behavior (
parallel=off, same MPS)Key observations
parallel=on; all three solve the same MPS trivially underparallel=off. Symptom is the parallel dual simplex algorithm path, not a recent change.WARNING: threads available = 4 < 8 = Simplex concurrency to be usedmessage consistently appears in stalling runs.simplex_max_concurrency=4was suggested in Inexplicable performance variation #1547 but did not help (perNPC's report there). Need to verify whether forcingsimplex_concurrency=threadsor similar would change behavior.What we already tried
parallel=off— works (63.89 s vs indefinite / time-limit-reached).threads=1— works (equivalent toparallel=offin practice; HiGHS falls back to serial EKK).simplex_max_concurrency=4— did not help in Inexplicable performance variation #1547's attempt.solver=ipm— avoids the stall but is 5–10× slower on this class of LP, and has its own crossover-to-simplex stall pattern.Workaround adopted
Set
parallel=offas the default in the project's solver options file. HiGHS maintainer@jajhallhas recommendedthreads=1as the canonical workaround in adjacent issues (#780, #1044), consistent with this finding. Until parallel dual simplex is either fixed or removed, we will not raisethreadsabove 1 for this class of model.Request
parallel=offde facto in the near term — so downstream projects can make informed defaults.pami_stall.mps.gz. Includes the two options files above, and the full logs (one run each ofparallel=onandparallel=offin HiGHS 1.14).Attachments
options_parallel_on.txt,options_parallel_off.txtlog_parallel_on_v1.14.txt— time-limit reached at 300 slog_parallel_off_v1.14.txt— optimal in 63.89 sRelated issues
parallel=onLP failure, closedhighs::parallel::spawninHekkDual.cpp#2566 — noteshighs::parallel::spawnfragility inHekkDual.cppPossible fixes
Some approaches you could try (least invasive first):
Serial bail-out on rescan failure
When the post-optimality rescan finds a residual AND the Markowitz bump produces N iterations (e.g., N=3) with no change in simplex iteration count or residual value, declare PAMI stalled and hand the current basis to serial EKK for cleanup. Serial handles this trivially. The user doesn't lose correctness or parallel speedup during the productive phase — only the final cleanup is serialized.
Flip the default
Make parallel=off the default. PAMI becomes opt-in (parallel=on explicitly). This doesn't fix the bug but aligns with @jajhall's own recommendation in LP failing with parallel=on (EKK parallel dual simplex) #780 and Julia binaries of HiGHS hang after solving a simple LP on Windows #1044, prevents users from hitting this by accident, and gives the maintainers time to address PAMI carefully. FlexTool defaults to that (but then does not benefit from parallelism in e.g. large MIP problems).
Root-cause-level (largest)
If the residual can be reliably reproduced in the PAMI basis but not the serial basis, there may be a basis-factorization race during HekkDual.cpp's parallel trajectory merge. Issue Avoid
highs::parallel::spawninHekkDual.cpp#2566 hints at this (highs::parallel::spawn noted as fragile in the same file). If concrete debugger data from log_dev_level=3 confirms it's a concurrent modification of the basis factors during the rescan phase, proper synchronization around the final cleanup merge would fix it.