Why:
The de-flake campaign exposed lookup-table reload tests that could resume
after HUP while the reload worker was still installing the replacement
table. The wait helper saw no pending reload and injected messages against
stale or stub lookup state.
Impact:
Lookup reload waiters now observe the full reload lifecycle.
Before/After:
Before, only queued reload requests were pending; after, an active reload
also remains pending until the table swap completes.
Technical Overview:
Track the interval after the reloader consumes do_reload but before
lookupDoReload() returns. lookupPendingReloadCount() now treats that
interval as pending, so imdiag's AwaitLookupTableReload command cannot
return while a reload is still applying. Initialize and clear the new state
alongside the existing reloader flags to keep startup, activation, and
shutdown state consistent.
Validation:
- ./autogen.sh --enable-debug --enable-testbench --enable-imdiag --enable-omstdout
- make -j$(nproc) check TESTS=""
- ./tests/array_lookup_table.sh
- ./tests/lookup_table_bad_configs.sh
- git diff --check
- Ubuntu 26.04 dev container focused lookup reload run, 9/9 pass:
array_lookup_table.sh array_lookup_table-vg.sh
array_lookup_table_misuse-vg.sh lookup_table_bad_configs.sh
lookup_table_bad_configs-vg.sh lookup_table_rscript_reload.sh
lookup_table_rscript_reload-vg.sh
lookup_table_rscript_reload_without_stub.sh
lookup_table_rscript_reload_without_stub-vg.sh
With the help of AI-Agents: OpenAI Codex
Why:
The MySQL action queue tests validate lossless ommysql delivery under
bounded queue pressure. Recent flake evidence showed the fixed 30000ms
enqueue timeout becoming the effective oracle when CI load or the local
MySQL server delayed draining.
Impact:
The tests allow a slower stressed MySQL service to drain queued bursts,
but still fail if the action queue cannot make progress within 80s.
Before/After:
Before, the tests could drop one message after a 30000ms enqueue wait.
After, they use an 80000ms CI tolerance budget and still pass only when
the final MySQL sequence is complete.
Technical Overview:
Keep the queue size and worker-thread settings that exercise bounded
multi-worker action queue behavior. Raise queue.timeoutEnqueue to 80000ms
rather than removing the timeout entirely, so persistent database stalls
or test plumbing problems remain visible. Add head comments documenting
each test's invariant, stimulus, oracle, and why the timeout is a bounded
CI tolerance rather than the behavior under test.
Validation:
- bash -n tests/mysql-actq-mt.sh tests/mysql-actq-mt-withpause.sh
- git diff --check
With the help of AI-Agents: Codex
Why:
The retry wrapper can inherit a failure marker left by the first
attempt. That marker can block the second attempt when abort-all is
enabled and can leave stale failure state after a successful retry.
Impact:
Retry behavior is now deterministic and does not leak stale testbench
failure state across attempts.
Before/After:
Before: retries could short-circuit or leave false failure artifacts.
After: retries clear marker state before rerun and on success.
Technical Overview:
The wrapper now removes testbench_test_failed_rsyslog when a retry
attempt succeeds.
It also removes the same marker after a failed attempt when another
attempt remains.
This preserves the intended retry flow while keeping final-failure
reporting unchanged on the last attempt.
The change is scoped to the test wrapper and does not modify runtime
code paths.
With the help of AI-Agents: GPT-5.3-Codex
Why:
omrelp keepalive settings were accepted but could be compiled out
because configure never defined HAVE_RELPCLTSETKEEPALIVE.
Impact: keepalive settings are now applied when librelp exports
the keepalive API.
Before/After: before keepalive options could be ignored silently;
after they are compiled in when supported by librelp.
Technical Overview:
- Add an AC_CHECK_FUNC probe for relpCltSetKeepAlive in the RELP
configure block.
- Define HAVE_RELPCLTSETKEEPALIVE when the symbol is available.
- This aligns configure-time feature detection with the existing
omrelp compile-time guard around relpCltSetKeepAlive().
With the help of AI-Agents: GPT-5.3-Codex
Why:
The de-flake campaign exposed a real imptcp race in the
processOnPoller="off" path under Ubuntu 26 TSAN. Multiple helper
workers could process one session concurrently and race on parser state.
Impact:
Fixes imptcp helper-worker session handling without reducing test scope.
Before/After:
Before, helper workers could race on one session; after, one worker owns
session processing, close, and rearm at a time.
Technical Overview:
Add a per-session queued-work flag protected by rsyslog's atomic helper.
Claim session epoll work before queueing it to helper workers.
Serialize receive parsing, zlib finish, session close, and epoll rearm.
Drop duplicate same-session events while already queued or processing.
Release the work claim before rearming the EPOLLONESHOT descriptor so a
fresh event cannot be lost behind the processing guard.
Avoid holding a pthread mutex across recv(), which would both hurt the
hot path and trip the clang static analyzer's blocking-in-critical-section
check.
Keep listener work concurrent and preserve helper parallelism across
independent sessions.
Document the non-processing-poller test intent and oracle.
With the help of AI-Agents: Codex
Why:
Regular PR CI should avoid waking long-running service-backed tests when a
change only touches unrelated helper code. Kafka, imfile, and Elasticsearch
are frequent long-tail costs, so they need focused relevance gates without
weakening full CI and flake-testing workflows.
Impact:
PR CI omits Kafka, imfile, and Elasticsearch tests for unrelated helper-only
changes, while direct module/test changes and plausible shared runtime paths
still run those families. Local CI-container runs can apply the same
relevance policy before devtools/run-ci.sh.
Before/After:
Before, broad runtime patterns made these expensive families run too often;
after, they use explicit focused dependency rules with full-run overrides.
Technical Overview:
Move the remaining root-level runtime C/H files under runtime/ so path-based
rules can reason about core code consistently. Keep conservative broad
relevance for service families that do not yet have focused dependency
rules. Add focused relevance for Kafka, imfile, and Elasticsearch covering
module paths, tests, build/testbench plumbing, config/message/action/queue,
worker, template, ruleset, parser, stats, and selected family-specific
runtime helpers. Keep isolated helpers such as lookup tables, dynstats, DNS
cache, crypto/KSI, GSSAPI, and unrelated protocol helpers from waking those
families. Add devtools/apply-service-relevance.sh so GitHub Actions and local
container testing share the same relevance-to-configure suppression logic.
Centralize Elasticsearch and Kafka job decisions on the top-level
change-scope outputs so scheduled jobs always run their test body. Preserve
RSYSLOG_TESTBENCH_FORCE_SERVICE_TESTS,
RSYSLOG_TESTBENCH_FORCE_<MODULE>_TESTS, and
RSYSLOG_TESTBENCH_SKIP_SERVICE_RELEVANCE so daily, weekly, and flake runs
can still force all tests even when there are no relevant changes. Document
that AI agents must validate both the relevance decision layer and the
resulting configured test list when changing these gates.
Validation:
bash -n tests/diag.sh devtools/apply-service-relevance.sh
git diff --check
actionlint .github/workflows/run_checks.yml
shellcheck -S warning devtools/apply-service-relevance.sh
module-needs-testing rule matrix for kafka, imfile, elasticsearch, mysql
Temporary git-diff probes for runtime/lookup.c and runtime/action.c
Source helper checks for runtime/lookup.c and runtime/action.c
Ubuntu 26.04 container make distclean plus MOCK-OK run-ci for runtime/lookup.c
With the help of AI-Agents: Codex
Why:
The de-flake campaign exposed a get_free_port race in the OTEL
collector test helper. A parallel test could claim the selected port
before otelcol bound it, while readiness checks still connected to the
wrong service.
Impact:
Makes OTEL-backed tests publish only collector-owned listener ports.
Before/After:
Before, OTEL tests preselected a racy port; after, otelcol binds
localhost port 0 and the testbench discovers the owned OTLP listener.
Technical Overview:
Configure the OTEL collector receiver and metrics endpoint with
localhost dynamic ports by default.
Start otelcol with exec so the stored PID owns the listener sockets.
Discover the actual OTLP HTTP port from /proc socket ownership and a
/v1/logs probe.
Write the test port file only after discovery and readiness succeed.
Keep explicit nonzero OTEL_COLLECTOR_ENDPOINT overrides working.
Move the discovery logic into an in-tree Python helper so normal Python
linting can inspect it.
Register the helper in EXTRA_DIST.
With the help of AI-Agents: Codex
closes https://github.com/rsyslog/rsyslog/issues/6017
Why:
The internal offAfterPRI field tracks the offset in raw messages
immediately after the PRI. This was inconsistently calculated across
modules (e.g. imuxsock omitted the closing '>') and was prone to
parsing invalid strings (e.g. '<>') as valid PRI offsets. This
caused misalignments and potential out-of-bounds risks in downstream
parser modules.
Impact:
Stabilizes syslog parsing; downstream modules consistently receive
accurate raw message text.
Before/After:
offAfterPRI was inconsistently calculated or misaligned on
malformed/special inputs; now it is centrally validated and correct.
Technical Overview:
Extracted the PRI offset logic into a strict static helper
compute_off_after_pri in runtime/parser.c to parse 1..3 digits
between '<' and '>'. Refactored ParsePRI to use this helper. Enhanced
MsgSetAfterPRIOffs in runtime/msg.c with defensive assertions to
validate offsets and enclosing brackets. Updated the legacy imuxsock
parser to set the correct offs + 1 offset when the closing '>' is
present. Created a pure C unit test checking 10 distinct
RFC3164/RFC5424 corner cases.
With the help of AI-Agents: Antigravity
* imfifo: implement named pipe input module
Why:
Allows rsyslog to read logs line-by-line from local POSIX named pipes
(FIFOs) without blocking the startup sequence or spinning on EOF
disconnect loops.
Impact:
Adds the 'imfifo' input module and registers its test suite.
Before/After:
Rsyslog had no native named pipe input capability; now imfifo
provides dynamic, non-blocking FIFO input instances.
Technical Overview:
- Integrated imfifo into the autotools build system with
--enable-imfifo.
- Implemented plugins/imfifo/imfifo.c using the modern v6 config
syntax.
- Used open(path, O_RDWR) to keep a dummy writer, avoiding
startup hangs and EOF reopen loops.
- Implemented select-polling loop with 100ms timeout for
clean, quick shut down responses.
- Splitted incoming chunks by newline, submitting complete
messages using submitMsg2.
- Created tests/imfifo.sh and tests/imfifo-vg.sh to verify
correct function and Valgrind compatibility.
closes https://github.com/rsyslog/rsyslog/issues/440
With the help of AI-Agents: Antigravity
* tests: widen service relevance defaults
Why: Service-backed tests were skipped for broad, non-module edits that\ncan still affect service integrations.\n\nImpact: Elasticsearch, MySQL/libdbi, and Kafka setup paths run for\nshared core, build, workflow, and testbench changes.\n\nBefore/After: Before, only runtime and a narrow allow-list triggered\nservice tests; after, common cross-cutting edits also trigger them.\n\nTechnical Overview: Extend the generic module_needs_testing()\nchanged-file gate in tests/diag.sh.\nTreat top-level C/H changes as globally relevant because they include\nshared engine files such as action.c/template.c.\nTreat build and CI metadata updates (.mk, m4, workflows) as relevant\nso service jobs selected by CI do not self-skip prematurely.\nTreat testbench shell/testsuites edits as relevant because service\norchestration and service-specific assertions live under tests/.\nKeep module-specific path matching unchanged for targeted triggering.\n\nWith the help of AI-Agents: GPT-5.3-Codex
Why:
The mmjsonparse find-json ownership fix is already present via PR #7016,
but the conflict-container path still needs explicit regression coverage.
Impact:
Adds focused normal and Valgrind testbench coverage for msgAddJSON failure
after mmjsonparse hands off a parsed JSON object.
Before/After:
Before, the negative path relied on manual reasoning and broad coverage.
After, the testbench asserts rsyslog continues processing the trigger
message, and the Valgrind wrapper checks that the parsed object is not
released twice.
Technical Overview:
1. Add mmjsonparse-find-json-conflict.sh for the conflicting-container path.
2. Add a Valgrind wrapper for the same scenario.
3. Register both tests in tests/Makefile.am.
With the help of AI-Agents: Codex
Why:
da-mainmsg-q is meant to exercise disk-assisted main queue draining,
but its diagnostic injector could overrun the deliberately tiny queue
under CI stress. That made the test report message loss before it had
actually isolated the DA queue behavior it intends to verify.
Impact:
Reduces da-mainmsg-q flakes without weakening the tested DA queue oracle.
Before/After:
Before, imdiag injected a 2000-message burst as non-delayable traffic;
after, the burst participates in queue flow control and the final output
count is observed before shutdown.
Technical Overview:
Set RSTB_IMDIAG_INJECT_DELAY_MODE=full before generate_conf so imdiag
marks generated messages as fully delayable. This keeps the test's small
queue configuration intact while avoiding diagnostic-input loss as a side
effect of the stress setup.
The test still verifies the complete sequence 0..2099 after forcing DA
mode. It now also waits for the final 2100 output lines after the post-DA
recovery burst, so shutdown is not used as a substitute for the omfile
output oracle.
The header comment was updated to document the setup, stimulus, oracle,
and why the injection mode is part of the test plumbing rather than the
behavior under test.
With the help of AI-Agents: OpenAI Codex