Skip to content

Parallel dual simplex (PAMI) stalls indefinitely #2991

@jkiviluo

Description

@jkiviluo

Bug report written mostly by Opus 4.7 on my instructions (and Opus did most of the root cause finding)

Title

Parallel dual simplex (PAMI) stalls indefinitely after post-optimality residual rescan; serial EKK resolves the same problem in one extra pivot

Environment

  • HiGHS: 1.11.0, 1.12.0, 1.14.0 — all three versions reproduce identically
  • Binding: reproduced via external binary (bin/highs_v1.11, bin/highs_v1.12) and highspy==1.14.0 (Python)
  • OS: Linux (6.14.0), x86_64, 4 physical cores available (symptom is worst with threads < 8)
  • Model: ~32 MB free-format MPS (5 MB gzipped) — 207,522 rows × 184,608 cols, 704,825 non-zeros. Pure LP (no integer variables in this reduction). From an energy-system optimization model (FlexTool, rivendell hydro-cascade scenario). Structure is typical of power-system dispatch: mixed demand-balance, min-up/min-down commitment logic, hourly timesteps.

Summary

With parallel=on (default when threads > 1), dual simplex reaches Pr=0, Du=0 on the first pass. A post-optimality quality rescan then reveals a small residual infeasibility (e.g. Pr: 1(1.07e-05)). HiGHS responds by bumping the Markowitz threshold to 0.5 and attempting to clear the residual via further pivots — but the simplex iteration count freezes at the same value and the residual never changes. The solve continues until the time limit.

With parallel=off, the exact same MPS solves to optimality in 63.89 s using one additional pivot (82448 → 82449). The residual is identical; serial EKK simply handles it correctly.

Appears to match the signature reported in #1547 (dormant since Nov 2024) and the e-zaline comment in #1044 (the Windows-specific thread-join hang was fixed in #1942 / 1.8.0, but the PAMI cycling was never addressed).

Reproduction

Command-line (external binary):

# Stalls — runs until time_limit
./highs --model_file flextool.mps --options_file options_parallel_on.txt

# Optimal in ~64 s
./highs --model_file flextool.mps --options_file options_parallel_off.txt

options_parallel_on.txt:

parallel=on
threads=4
time_limit=300

options_parallel_off.txt:

parallel=off
threads=1
time_limit=300

Python (highspy==1.14.0):

import highspy
h = highspy.Highs()
h.setOptionValue("output_flag", True)
h.setOptionValue("log_to_console", True)
h.setOptionValue("parallel", "off")   # flip to "on" to reproduce the stall
h.setOptionValue("threads", 1)        # change to 4 with parallel=on
h.setOptionValue("time_limit", 120.0)
h.readModel("flextool.mps")
h.run()
print(h.getModelStatus(), h.getObjectiveValue())

(Note: within a single Python process the global scheduler is initialized once, so on/off comparisons require separate processes — Option 'threads' is set to 1 but global scheduler has already been initialized appears otherwise.)

Observed behavior (parallel=on)

WARNING: Number of threads available = 4 < 8 = Simplex concurrency to be used:
         Parallel performance may be less than anticipated
Using EKK parallel dual simplex solver - PAMI with concurrency of 8
  Iteration        Objective     Infeasibilities num(sum)
          0     0.0000000000e+00 Pr: 13888(38357.9) 1s
      31902     2.7451378939e+00 Pr: 17768(582859); Du: 0(3.00727e-06) 6s
      46928     3.2000446137e+00 Pr: 18140(88421.5); Du: 0(2.12556e-06) 11s
      55511     3.4904474115e+00 Pr: 15695(1.03267e+06); Du: 0(2.06150e-06) 16s
      78341     3.3582641992e+00 Pr: 0(0); Du: 2116(1.46961) 33s
      81843     3.3391223839e+00 Pr: 0(0); Du: 544(0.426523) 38s
      83295     3.3345044289e+00 Pr: 0(0); Du: 0(8.79675e-12) 41s   ← reaches optimum
Using EKK parallel dual simplex solver - PAMI with concurrency of 8
WARNING:    Increasing Markowitz threshold to 0.5
      83295     3.3345044289e+00 Pr: 1(1.07048e-05)             46s   ← quality rescan finds residual
      83295     3.3345044289e+00 Pr: 1(1.07048e-05)             51s   ← frozen; no progress
      83295     3.3345044289e+00 Pr: 1(1.07048e-05)             56s
      ... residual never changes, iteration never advances, time limit reached at 300 s ...

Expected behavior (parallel=off, same MPS)

Using EKK dual simplex solver - serial
  Iteration        Objective     Infeasibilities num(sum)
          0     0.0000000000e+00 Pr: 13888(38357.9) 0s
      82448     3.3345045089e+00 Pr: 0(0) 61s
Using EKK dual simplex solver - serial
      82448     3.3345045089e+00 Pr: 1(1.07e-05) 61s           ← same residual
      82449     3.3345045097e+00 Pr: 0(0) 61s                  ← one pivot clears it
Performed postsolve
Model status : Optimal
Simplex iterations: 82449
HiGHS run time : 63.89

Key observations

  1. Not a version regression — 1.11, 1.12, 1.14 behave identically on the same MPS under parallel=on; all three solve the same MPS trivially under parallel=off. Symptom is the parallel dual simplex algorithm path, not a recent change.
  2. PAMI hardcoded concurrency = 8 matters. The WARNING: threads available = 4 < 8 = Simplex concurrency to be used message consistently appears in stalling runs. simplex_max_concurrency=4 was suggested in Inexplicable performance variation #1547 but did not help (per NPC's report there). Need to verify whether forcing simplex_concurrency=threads or similar would change behavior.
  3. Residual reappears at the same magnitude every rescan. The pivot to clear it succeeds under serial; it is not a fundamentally infeasible problem.
  4. The root-cause appears to be concurrent simplex trajectories disagreeing on how to clear the rescan residual, combined with the Markowitz threshold bump preventing any of them from making the one pivot that serial EKK makes.

What we already tried

  • parallel=offworks (63.89 s vs indefinite / time-limit-reached).
  • threads=1works (equivalent to parallel=off in practice; HiGHS falls back to serial EKK).
  • simplex_max_concurrency=4 — did not help in Inexplicable performance variation #1547's attempt.
  • solver=ipm — avoids the stall but is 5–10× slower on this class of LP, and has its own crossover-to-simplex stall pattern.
  • Upgrading HiGHS — no fix in any tagged release through 1.14.

Workaround adopted

Set parallel=off as the default in the project's solver options file. HiGHS maintainer @jajhall has recommended threads=1 as the canonical workaround in adjacent issues (#780, #1044), consistent with this finding. Until parallel dual simplex is either fixed or removed, we will not raise threads above 1 for this class of model.

Request

  1. Confirm whether this is the same root cause as #1547, or a separate pathology that happens to share the "Markowitz=0.5, iter frozen" signature.
  2. If Inexplicable performance variation #1547 is known and still under consideration, clarify whether a fix is planned for a tagged release or whether PAMI is expected to remain parallel=off de facto in the near term — so downstream projects can make informed defaults.
  3. Reproduces in attached pami_stall.mps.gz. Includes the two options files above, and the full logs (one run each of parallel=on and parallel=off in HiGHS 1.14).

Attachments

  • pami_stall.mps.gz — 5 MB gzipped MPS reproducer (LP only, no integrality; 207,522 rows × 184,608 cols)
  • options_parallel_on.txt, options_parallel_off.txt
  • log_parallel_on_v1.14.txt — time-limit reached at 300 s
  • log_parallel_off_v1.14.txt — optimal in 63.89 s

Related issues

Possible fixes

Some approaches you could try (least invasive first):

  1. Serial bail-out on rescan failure
    When the post-optimality rescan finds a residual AND the Markowitz bump produces N iterations (e.g., N=3) with no change in simplex iteration count or residual value, declare PAMI stalled and hand the current basis to serial EKK for cleanup. Serial handles this trivially. The user doesn't lose correctness or parallel speedup during the productive phase — only the final cleanup is serialized.

  2. Flip the default
    Make parallel=off the default. PAMI becomes opt-in (parallel=on explicitly). This doesn't fix the bug but aligns with @jajhall's own recommendation in LP failing with parallel=on (EKK parallel dual simplex) #780 and Julia binaries of HiGHS hang after solving a simple LP on Windows #1044, prevents users from hitting this by accident, and gives the maintainers time to address PAMI carefully. FlexTool defaults to that (but then does not benefit from parallelism in e.g. large MIP problems).

  3. Root-cause-level (largest)
    If the residual can be reliably reproduced in the PAMI basis but not the serial basis, there may be a basis-factorization race during HekkDual.cpp's parallel trajectory merge. Issue Avoid highs::parallel::spawn in HekkDual.cpp #2566 hints at this (highs::parallel::spawn noted as fragile in the same file). If concrete debugger data from log_dev_level=3 confirms it's a concurrent modification of the basis factors during the rescan phase, proper synchronization around the final cleanup merge would fix it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions