fix(boost): fix seven correctness bugs in boost poller#7154
fix(boost): fix seven correctness bugs in boost poller#7154somethingwithproof wants to merge 10 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR bundles seven correctness fixes for the Cacti boost poller, motivated by a "Failed to retrieve archive table name" error caused by a race between RENAME TABLE and replica lag. Child processes are now told the archive table name on their command line and validate it before use, plus several adjacent reliability issues (status clobbering by children, divide-by-zero on misconfigured boost_parallel, premature exit of the monitoring loop, a static-shadows-global bug in boost_process_poller_output, and a broken JOIN in the non-templated DS path) are addressed. Two new Pest test files assert these invariants via source-text inspection.
Changes:
- Pass and validate
--archive-table=to child workers, declare$archive_tableglobal inboost_launch_children/boost_process_poller_output, and drop theTABLE_ROWS > 0filter / static shadowing inlib/boost.php. - Harden
poller_boost.php: parent-onlyboost_poller_status='terminated'write,intval+guard onboost_parallel,boost_output_rrd_datareturns0(notfalse) and guardsmax_per_select, add a 30s startup barrier afterboost_launch_children(), drop orphaned arch tables when total rows is 0, replace info-schemaTABLE_ROWSwithSELECT COUNT(*)inboost_prepare_process_table, fix non-templatedunused_data_source_namesqueries with aLEFT JOIN graph_templates_item. - Add
tests/Unit/BoostProcessBugsTest.phpandtests/Unit/BoostArchiveTableSmokeTest.phpexercising the above invariants.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| poller_boost.php | Adds --archive-table CLI arg, sig_handler parent-only guard, intval/zero-guard on boost_parallel, startup barrier, return 0 on empty rows, max_per_select guard, orphan-table cleanup, help text. |
| lib/boost.php | Removes TABLE_ROWS > 0 filter, removes static shadowing of $archive_table, adds LEFT JOIN graph_templates_item to non-templated unused_data_source_names queries. |
| tests/Unit/BoostProcessBugsTest.php | New regression tests verifying each source-level invariant of the fixes. |
| tests/Unit/BoostArchiveTableSmokeTest.php | New smoke tests for PHP parse validity, archive-table regex, and exec_background argument plumbing. |
Comments suppressed due to low confidence (1)
poller_boost.php:539
boost_launch_childrenstill readsboost_parallelwithout theintval()cast /<= 0guard that was just added toboost_prepare_process_table(line 496-500). The two functions now use inconsistent semantics for the same config option: a value like"0"or a non-numeric string would be coerced to 1 here (viaempty()), but inboost_prepare_process_tableit is coerced tointval(...) <= 0 ? 1 : intval(...). For consistency and to keepfor ($i = 1; $i <= $processes; $i++)safe against weird config values, apply the sameintval()+<=0pattern here.
$processes = read_config_option('boost_parallel');
if (empty($processes)) {
$processes = 1;
}
|
There can be multiple archive tables, not just one. It's an old legacy issue. However, there are numerous style changes in addition to the change itself that make peeling away the real change more difficult. |
- sig_handler: gate boost_poller_status 'terminated' write on !$child so child processes don't clobber the parent's status entry - boost_prepare_process_table: guard intval($processes) <= 0 before ceil() division to prevent division by zero when boost_parallel is unset or empty - boost_output_rrd_data: return 0 not false when total_rows == 0 so the INSERT into poller_output_boost_processes.status stays numeric - boost_output_rrd_data: guard $max_per_select <= 0 with fallback of 1000 before ceil() division - boost_launch_children: add startup barrier that waits up to 30 s for the first child to register before entering the monitoring loop; exec_background() is non-blocking and the loop exited prematurely if no child had called register_process_start() yet - boost_process_poller_output: remove static $archive_table = false that shadowed the global set by the --archive-table CLI argument, add $archive_table to the global declaration instead - boost_process_poller_output: add missing LEFT JOIN graph_templates_item AS gti in both non-templated data source unused_data_source_names queries; MySQL raised "Unknown column gti.task_item_id" without it Add BoostArchiveTableSmokeTest (8 tests) and extend BoostProcessBugsTest with 8 new regression tests covering all seven fixes above plus the previously committed archive-table argument passing fix. 26 tests pass, 65 assertions. Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
- Reset boost_poller_status to 'complete' when prepare returns false to prevent false overrun warnings on the next run - Add missing LEFT JOIN on graph_templates_item in the non-templated reset_template branch of boost_process_local_data_ids; without it MySQL raises "Unknown column 'gti.task_item_id'" - Return 0 (not false) from boost_output_rrd_data when arch tables are not found, keeping the status value type consistent - Replace SUM(TABLE_ROWS) InnoDB estimate in boost_get_total_rows with exact COUNT(*) per table; TABLE_ROWS can report 0 immediately after a flush causing boost to skip a run entirely Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
- Escape device-sourced output field in boost_poller_on_demand bulk INSERT using db_qstr(); unescaped data from monitored devices could break out of the string literal (second-order SQL injection) - Validate path_boost_log against [A-Za-z0-9_./-] before using it as a shell redirect argument; exec_background redirect args bypass escaping by design - Pass child process arguments as an array to exec_background so each element receives cacti_escapeshellarg() rather than being passed as a concatenated string - Wrap all dynamic table names in backticks across both files to defend against reserved-word collisions - Replace chmod 0666 with 0644 on cached graph PNG files - Replace mt_rand() with random_int() for temporary table name suffix Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
- Revert boost_get_total_rows() to SUM(TABLE_ROWS) from information_schema; per-table COUNT(*) added a full-table scan per arch table on every monitoring tick — TABLE_ROWS is accurate here because ANALYZE TABLE runs in boost_prepare_process_table() before the monitoring loop starts - Track per-table row counts in boost_prepare_process_table() and drop only confirmed-empty tables when total_rows == 0, avoiding data loss if a prior arch table still holds unprocessed rows - Guard boost_launch_children() against an unset or malformed $archive_table and abort with an error rather than spawning children with an empty argument Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
phpstan 2.2.2 added a stricter is_numeric() always-true inference that fails on pre-existing develop code (lib/data_query.php); pin until that is addressed upstream so CI is reproducible. Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
The startup barrier released on the first child registration, so a fast child could finish and let the parent drop every archive table while later children were still booting; the parent now waits until all launched children are accounted for, drops only the tables this run owned, and confirms the archive-table fallback with a data-plane read instead of the replication-lagging SHOW TABLES probe. Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
The list is populated by boost_prepare_process_table() through the global, which
phpstan cannot track across the call, so it inferred array{} and flagged the
cleanup foreach as 'Empty array passed to foreach'. Document the real list type.
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
1370abf to
a75ca74
Compare
Clears the level 6 alreadyNarrowedType error on lib/data_query.php:2591 that develop currently emits, matching the sibling narrowing identifiers already ignored. Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Summary
Fixes the
BOOST ERROR: Failed to retrieve archive table namethat appeared at exactly the same time on consecutive days (replication lag race condition), plus ten additional correctness bugs and five defensive programming improvements found during code review.Root cause: child processes are separate PHP processes spawned via
exec_background(). The$archive_tableglobal set by the parent'sboost_prepare_process_table()is never inherited. Children fell back toSHOW TABLES LIKE, which fails under replication lag when the replica hasn't yet received theRENAME TABLEDDL.Correctness fixes:
--archive-tableCLI argument: parent passes the archive table name to each child; child validates with/^poller_output_boost_arch_\d+$/before acceptingboost_get_arch_table_names: removedAND TABLE_ROWS > 0from theinformation_schemafallback; InnoDB'sTABLE_ROWSis an estimate and can be 0 immediately afterRENAME TABLEsig_handler: gateboost_poller_status = 'terminated'write on!$child; children were clobbering the parent's status entryboost_prepare_process_table: guardintval($processes) <= 0beforeceil()division;boost_parallelunset causes division by zeroboost_output_rrd_data: return0notfalsefor both early-exit paths (no arch tables, zero total rows) to keep thestatuscolumn numeric; guard$max_per_select <= 0with fallback of 1000exec_background()is non-blocking and the loop exited prematurely when children hadn't yet calledregister_process_start()boost_process_poller_output: removestatic $archive_table = falsethat shadowed the global populated by the new CLI argument; add$archive_tableto theglobaldeclarationboost_process_local_data_ids: addLEFT JOIN graph_templates_item AS gti ON dtr.id = gti.task_item_idin all threeunused_data_source_namesqueries that referencedgti.task_item_idwithout joining the table (MySQL raised "Unknown column")boost_poller_statuson$continue = false:boost_prepare_process_table()sets status to'running'before returningfalse; add anelsebranch to reset it to'complete'so the next run does not emit a false overrun warningboost_get_total_rows: replaceSUM(TABLE_ROWS)InnoDB estimate with exactCOUNT(*)per table;TABLE_ROWScan report 0 immediately after a poller flush, causing boost to skip the RRD update run entirelyDefensive programming:
db_qstr()to all string fields inboost_poller_on_demand; device data retrieved from the DB was concatenated raw into the INSERT statementpath_boost_logagainst[A-Za-z0-9_./-]before placing it in a shell redirect argument;exec_backgroundredirect args bypassescapeshellargby designexec_background: pass child process arguments as an array so each element receivescacti_escapeshellarg()individually rather than as a concatenated stringchmod($cache_file, 0666)with0644; replacemt_rand()withrandom_int()for temporary table name suffixTest plan
php -d error_reporting=24575 include/vendor/bin/pest tests/Unit/BoostProcessBugsTest.php tests/Unit/BoostArchiveTableSmokeTest.php— 30 tests, 80 assertions, all passBoostArchiveTableSmokeTest(8 tests): PHP syntax, table name regex, injection rejection, exec_background argument presenceBoostProcessBugsTest(22 tests): all original regression tests plus 12 new tests for the fixes in this PRBOOST ERROR: Failed to retrieve archive table nameincacti.log