Performance Accelerators
This chapter documents the performance optimization techniques built into Cobre’s SDDP solver. Each accelerator addresses a specific cost driver in the training loop and is active by default unless noted otherwise. Understanding them helps users interpret timing statistics, configure cut management strategies, and diagnose performance regressions.
LP Setup Optimizations
Each SDDP iteration requires solving hundreds to thousands of LP subproblems. Minimizing per-solve overhead is critical.
Model Persistence
The structural LP for each stage (the constraint matrix, variable bounds,
and objective coefficients) is assembled once at initialization into a
StageTemplate. During the training loop, the solver loads the template
once per (worker, stage) pair and then only patches the scenario-dependent
row bounds for each forward-pass scenario. This avoids rebuilding the
entire LP from scratch at every scenario evaluation.
The simulation pipeline uses the same pattern: a stage-major loop loads
the LP once per (worker, stage) and then iterates over scenarios, patching
bounds only. This reduces LP assembly overhead from O(scenarios x stages)
to O(workers x stages).
Incremental Cut Injection
Benders cuts are appended to the persistent lower-bound LP via add_rows
without rebuilding the structural model. A CutRowMap provides O(1)
slot-to-row lookup so the incremental append skips cuts that are already
present.
The LB LP is strictly append-only: rows generated during training are appended and never removed, which keeps the lower bound monotonically non-decreasing across iterations. Row selection in the shared row pool still affects the forward and backward passes — pool-deactivated rows remain as LP rows in the LB solver but are not re-evaluated, so they contribute only their binding value at the trial point.
PatchBuffer Pre-Allocation
The PatchBuffer holds three parallel arrays (indices, lower, upper)
consumed by the solver’s set_row_bounds call. It is sized once at
construction for the maximum number of patches across all stages:
| Category | Range | Content |
|---|---|---|
| 1 | [0, N) | Storage-fixing: equality constraint at incoming storage |
| 2 | [N, N*(1+L)) | Lag-fixing: equality constraint at AR lagged inflows |
| 3 | [N*(1+L), N*(2+L)) | Noise-fixing: equality constraint at scenario noise |
| 4 | [N*(2+L), N*(2+L) + M*B) | Load balance: stochastic load demand per bus per block |
| 5 | [N*(2+L) + M*B, ...) | z-inflow RHS: inflow variable bounds |
Where N = hydro plants, L = max PAR order, M = stochastic load buses, B = max blocks per stage. The buffer is reused across all iterations and scenarios with zero hot-path allocation.
Solver Safeguards
When HiGHS returns a non-terminal error (SOLVE_ERROR or UNKNOWN),
the solver automatically escalates through a 12-level retry sequence
organized in two phases, with per-level and overall wall-clock budgets.
The caller never sees intermediate failures — only the final
Ok(solution) or Err(SolverError).
Phase 1 (levels 0–4): Cumulative Sequence
Each level stacks on top of the previous:
| Level | Action |
|---|---|
| 0 | Clear cached basis and factorization |
| 1 | Enable presolve |
| 2 | Switch to dual simplex |
| 3 | Relax feasibility tolerances (1e-6) |
| 4 | Switch to interior point method (IPM) |
Phase 2 (levels 5–11): Extended Strategies
Each level starts from restored defaults with presolve and iteration limits, then applies level-specific options:
| Level | Action |
|---|---|
| 5 | Scale strategy 3 |
| 6 | Primal simplex + scale strategy 4 |
| 7 | Scale strategy 3 + relaxed tolerances |
| 8 | Objective scale (-10) |
| 9 | Primal simplex + objective scale (-10) + bound scale (-5) |
| 10 | Objective scale (-13) + bound scale (-8) + relaxed tolerances |
| 11 | IPM + objective/bound scaling + relaxed tolerances |
Budgets: 15 seconds per level in Phase 1, 30 seconds per level in
Phase 2, 120 seconds overall. Iteration limits are set to
max(100_000, 50 x num_cols) for simplex and 10,000 for IPM.
Default solver settings are restored unconditionally after the retry loop,
regardless of outcome. The per-level retry histogram is recorded in
SolverStatistics.retry_level_histogram and written to
training/solver/retry_histogram.parquet for post-run analysis.
LP Scaling
Before each stage’s LP template is built, a prescaler normalizes the constraint matrix coefficients toward 1.0, improving numerical conditioning and reducing the need for HiGHS’s internal scaling.
Column Scaling
For each column j, the scale factor is
1 / sqrt(max|A_ij| * min|A_ij|) over non-zero entries. The matrix
values, objective coefficients, and column bounds are scaled in-place.
After solving, primal values are unscaled: x_original[j] = col_scale[j] * x_scaled[j].
Row Scaling
Applied after column scaling with the same geometric-mean formula per row.
After solving, duals are unscaled: dual_original[i] = row_scale[i] * dual_scaled[i].
Cost Scale Factor
A constant COST_SCALE_FACTOR = 1000 is applied to all objective
coefficients to reduce the magnitude of objective coefficients, improving
simplex numerical stability.
Because the prescaler normalizes matrix entries toward 1.0, HiGHS’s
internal scaling (simplex_scale_strategy) is disabled (set to 0) in
every solver profile — including the retry-escalation levels — to avoid
double-scaling the already-conditioned matrix.
The scaling diagnostics are written to training/scaling_report.json
after template construction, documenting the coefficient range before
and after scaling for each stage.
Cut Management Pipeline
As training progresses, the row pool grows and LP solve times increase. Cobre provides a two-stage row management pipeline to control this growth while preserving convergence guarantees.
The pipeline runs after each iteration’s backward pass and cut synchronization:
Stage 1: Strategy-based selection (check_frequency gated)
|
v
Stage 2: Budget enforcement (every iteration)
Stage 1: Strategy-Based Selection
Four strategies are available, configured via
cut_selection in config.json:
| Strategy | Selection Mechanism | Aggressiveness |
|---|---|---|
level1 | Deactivates cuts below tie_tolerance of the per-state max at every visited state | Least |
lml1 | Deactivates cuts that are not the oldest eligible within tie_tolerance at any visited state | Medium |
domination | Deactivates cuts below domination_tolerance of the per-state max at every visited state (all populated cuts) | Most |
dynamic | Lazy incremental scheme: adds at most max_added_per_round cuts per inner re-solve round that violate the current LP solution by more than violation_tolerance; never deactivates cuts from the pool | Different |
level1, lml1, and domination respect check_frequency: selection runs
only at iterations that are multiples of check_frequency. Stage 0 is
always exempt (its rows drive the lower bound and are never backward-pass
successors). Selection runs in parallel across stages via rayon.
level1, lml1, and domination share a single value-evaluation kernel
that performs O(|populated cuts| x |visited states|) work per stage per
check. Every populated cut is evaluated at every visited forward-pass
state (including cuts currently flagged inactive, which means a previously
deactivated cut can be reactivated when it later achieves the maximum at
some state). The visited-states archive is collected during training for
these three variants. The tie_tolerance parameter (default 1e-10) on
level1 and lml1 controls how closely a cut must approach the per-state
maximum to be retained; domination uses the domination_tolerance field
for the same purpose.
dynamic (Dynamic Cut Selection, DCS) operates differently: it is a
per-solve lazy selection loop that adds cuts on demand rather than
deactivating from a full pool scan. It never invokes the value-evaluation
kernel and does not respect check_frequency. The initial active set is
seeded from the seed_window most recent iterations. See
cut_selection for the full parameter
reference.
Stage 2: Budget Enforcement
A hard-cap safety net on LP size, enabled via max_active_per_stage.
When the number of active rows exceeds the budget after Stage 1, the
pool evicts rows sorted by staleness (last_active_iter ascending,
then active_count ascending). Rows from the current iteration are
always protected.
Unlike Stage 1, budget enforcement runs every iteration (not gated
by check_frequency).
Configuration:
{
"training": {
"cut_selection": {
"max_active_per_stage": 500,
"selection": {
"method": "level1",
"tie_tolerance": 1e-10,
"check_frequency": 5
}
}
}
}
Why it matters: High-parallelism configurations (many forward passes, few iterations) accumulate more active rows than low-parallelism configurations (fewer forward passes, more iterations), making each backward LP solve proportionally more expensive. Bounding LP size makes high-parallelism configurations viable without unbounded solve-time growth.
Observability
The row management pipeline writes per-stage statistics to
training/cut_selection/iterations.parquet with 10 columns:
| Column | Description |
|---|---|
iteration | Training iteration |
stage | Stage index |
cuts_populated | Total row slots populated |
cuts_active_before | Active rows before selection |
cuts_deactivated | Rows deactivated by Stage 1 |
cuts_reactivated | Rows reactivated by Stage 1 |
cuts_active_after | Active rows after Stage 1 |
selection_time_ms | Wall-clock time for the selection |
budget_evicted | Rows evicted by Stage 2 (null if disabled) |
active_after_budget | Active rows after Stage 2 (null if disabled) |
Basis Warm-Start
Reusing the LP simplex basis from the previous solve reduces the number of simplex pivots needed for subsequent solves.
BasisStore
The BasisStore holds one Basis per (scenario, stage) pair in a flat
array indexed as bases[scenario * num_stages + stage]. Before the parallel
forward pass, the store is split into disjoint per-worker sub-views
(split_workers_mut) so no synchronization is needed during writes.
The Basis struct stores solver-native i32 status codes directly,
enabling zero-copy warm-starts via memcpy — no per-element enum
translation is needed.
Simulation Basis Broadcast
When running with MPI, rank 0’s scenario-0 basis is broadcast to all ranks before the simulation phase. This ensures all ranks warm-start simulation from the same LP vertex, regardless of rank count.
Basis Reconstruction
Each stored warm-start basis is wrapped in a CapturedBasis { basis, base_row_count, cut_row_slots, state_at_capture } struct that records
the LP row count and the ordered list of row-pool slot indices at
capture time, alongside the state vector at which the basis was
captured. The reconstruct_basis function in
cobre-sddp::basis_reconstruct is the sole entry point for applying a
stored basis across row-set churn on the forward pass, backward pass,
and simulation pipeline.
When a stored basis is applied to an LP whose appended rows have changed,
reconstruct_basis walks the current LP’s appended rows, looks each slot
up in an O(1) scratch map built from cut_row_slots, and classifies
each row into one of two paths:
- Preserved (slot present in the stored basis): the original status is copied verbatim.
- New (slot not present — a row added since capture): the row is
unconditionally assigned
NONBASIC_LOWER(tight guess).
Each NONBASIC_LOWER classification on a new row requires a
compensating demotion on a preserved row to keep HiGHS’s
column-basic + row-basic invariant. The stalest preserved-LOWER
candidate is promoted, ranked lexicographically by insertion order.
When new-LOWER classifications outnumber preserved-LOWER
candidates, a tail fallback flips the most recent new-LOWER rows
back to BASIC until the invariant holds.
Reconstruction is always active when a stored basis exists — there is
no configuration flag. The basis_activity_window config knob that
earlier versions accepted has been removed; a config that still sets it
now fails to load with an unknown-field error.
The in-memory SolverStatistics::basis_reconstructions counter tracks how
often reconstruct_basis was invoked with a non-empty stored basis.
Backward-Pass Basis Cache
During training, rank 0’s ω=0 backward-pass worker captures a fresh
basis for every stage into a per-iteration backward cache. At end of
iteration the cache is broadcast to all ranks, and on the next
iteration’s backward pass every rank’s ω=0 solve warm-starts from the
cached basis instead of falling back to the forward-pass BasisStore.
The first iteration has no backward cache yet, so it uses the forward
cache exclusively.
The backward cache matters because rows added earlier in the current
iteration’s backward walk are new relative to the previous
iteration’s stored basis — so the classifier fires frequently on
backward solves, while the forward pass sees mostly preserved slots
and the classifier rarely runs. A warm-start at ω=0 also cascades
through the remaining openings (ω=1..n_openings-1) via HiGHS’s
retained factorization, amplifying the per-solve impact.
Parallel Execution
Backward Pass Work-Stealing
The backward pass parallelizes the inner trial-point loop using atomic
counter work-stealing: each worker claims the next available trial-point
index via AtomicUsize::fetch_add(1, Relaxed). This keeps all threads
busy even when trial points solve in variable time.
After the parallel region, staged rows are sorted by trial_point_idx
and inserted into the FCF in deterministic order, guaranteeing bit-for-bit
identical results regardless of thread count or completion order.
Per-Phase Solver Profiles
Each algorithmic phase — forward sweep, backward sweep, and simulation — can
be configured with a distinct HighsProfile that sets the LP solver’s
feasibility tolerances and per-attempt iteration caps. Tuning BACKWARD_PROFILE
to tighter tolerances or stricter iteration caps can reduce backward-pass
solve time variance, which in turn improves load balance across worker threads
and shortens wall-clock training time. FORWARD_PROFILE and
SIMULATION_PROFILE ship equal to HighsProfile::default(), while
BACKWARD_PROFILE already overrides simplex_price_strategy to 2
(RowHyperSparse) to exploit sparsity on the backward LPs; all other backward
fields match the default.
Forward Pass and Simulation
Scenarios are statically partitioned across solver workspace instances (not rayon’s default work-stealing), making the scenario-to-worker assignment deterministic. Within each scenario, the LP is loaded once per stage and only row bounds are patched per scenario.
Lower Bound Evaluation
The lower bound evaluation (solving a stage-0 LP for every opening in the tree) runs as a single-threaded serial loop on rank 0. Each opening patches correctness-critical per-opening state (e.g. NCS column bounds) on a shared solver, so the openings cannot be split across workers without fragmenting those sequential steps; the step is therefore not parallelized.
Communication-Free Seed Derivation
Forward pass noise is generated without inter-rank communication. Each
rank independently derives its noise seed from
(base_seed, iteration, scenario, stage) using deterministic SipHash-1-3
seed derivation. The opening tree is pre-generated once before training
and shared read-only.
Memory Efficiency
Pre-Allocation Discipline
The forward, backward, and simulation per-solve hot paths make no heap allocations inside the iteration loop; all workspace buffers are allocated once before the loop. (The periodic cut-selection pass is the one documented exception — its rayon fold/reduce kernel allocates per-leaf scratch.) The pre-allocated buffers are:
| Buffer | Size |
|---|---|
TrajectoryRecord flat vec | forward_passes x num_stages records |
PatchBuffer | N*(2+L) + M*max_blocks entries |
ExchangeBuffers (state allgatherv) | local_count x num_ranks x n_state floats |
CutSyncBuffers (row-sync allgatherv) | max_cuts_per_rank x num_ranks x cut_wire_size bytes |
ScratchBuffers per worker | noise, inflow, lag matrix, PAR, eta, load, z-inflow buffers |
Basis per worker | pre-allocated with template_rows + max_cut_rows entries |
CutPool Flat Coefficient Storage
Row coefficients are stored as a single contiguous Vec<f64> of size
capacity x state_dimension rather than a Vec<Vec<f64>>. This provides
cache-friendly sequential access during batch iteration (row evaluation,
dominance checks) and eliminates per-row heap allocation.
Lazy FCF Growth
The CutPool grows its coefficient storage on demand using a doubling
strategy (minimum 16 slots) rather than pre-allocating to the theoretical
maximum capacity. This prevents memory exhaustion on pathological parameter
combinations (e.g., 1000 iterations x 1000 forward passes x 50 states x
120 stages would require 48 GB with eager pre-allocation).
O(1) Active Row Count
CutPool maintains a cached_active_count that is updated incrementally
on each activation/deactivation, making active_count() O(1) instead of
requiring a scan of the entire pool.
Compile-Time Solver Dispatch
SolverInterface is resolved as a generic type parameter at compile time,
not as Box<dyn SolverInterface>. All solver calls monomorphize to direct
function calls with no virtual dispatch overhead — critical when tens of
millions of LP solves occur per training run.
See Also
- Configuration — row-selection and row management configuration
- Output Format — timing, solver statistics, and row-selection output schemas
- cobre-solver — solver interface and retry escalation details
- cobre-sddp — training loop architecture and data structures