Why:
PR 6894 adds the cumulative size.enqueued queue counter, so the omfwd_fast_imuxsock expectation needs to match the extended queue stats line.
Impact:
Fixes the affected test expectation and removes one shellcheck warning.
Before/After:
Before, the test expected enqueued to be followed directly by full. After, it accepts the new size.enqueued field in between.
Technical Overview:
The omfwd queue stats regex now includes the size.enqueued field added by the queue byte accounting change.
The test header documents the queue-stats oracle.
PORT_RCVR assignment is split from export so shellcheck can preserve the get_free_port exit status.
With the help of AI-Agents: Codex
Subject: queue: add per-queue size.enqueued counter (cumulative bytes)
queue: add per-queue size.enqueued counter (cumulative bytes)
Add a new uint64 stats counter ctrSizeEnqueued, exposed by impstats as
"size.enqueued" alongside the existing "enqueued" message counter, so
operators can monitor the byte volume flowing through every named queue
(main / ruleset / action) instead of only the message count.
The counter is incremented in doEnqSingleObj() right after ctrEnqueued,
before any discard or flow-control check, so it represents the inbound
arrival rate. Discarded bytes remain visible via ctrFDscrd / ctrNFDscrd
on the message side.
Resettable, atomic via STATSCOUNTER_ADD. Pure addition: no behavioural
change for existing counters or queue semantics.
Why:
Disk and disk-assisted queues must not leave stale corrupted queue
segments live after runtime dequeue corruption. The disk-assisted child is
also a real disk queue, so it needs the same corruption semantics as a
queue configured directly as disk.
Impact:
Safe-mode corruption recovery now quarantines unread runtime queue tails
for pure disk and disk-assisted queues and logs visible recovery messages.
Before/After:
Before, some runtime corruptions could fall back to invalid .qi recovery
or leave DA disk child handling inconsistent; after, unread corrupted
tails are quarantined and fresh queue state is constructed.
Technical Overview:
Propagate queue.onCorruption from the DA parent to the disk child.
Apply bounded runtime corruption recovery to DA disk children as well as
pure disk queues.
Keep the existing single-record invalid .qi recovery path separate so the
legacy DA .qi test still exercises that scenario.
Reset disk state after quarantine without double-subtracting imdiag queue
counters, and subtract skipped logical tail records explicitly.
Add pure disk and DA segment-corruption tests that verify quarantine,
user-visible messages, and fresh live queue files only.
Document that DA mode is an in-memory parent plus a first-class disk
child with normal disk queue semantics.
With the help of AI-Agents: Codex
Why:
Older platforms need consistent formatted string allocation, and the
remaining copy helpers kept triggering review noise around classic C
string APIs.
A major motivation is to avoid very common AI review false positives:
those tools often do not understand the actual scope and safety checks,
and then mechanically flag strcpy-style APIs despite the surrounding
bounds and initialization logic being correct.
Impact: string allocation and bounded copy paths are now explicit and
portable across the tree.
Before/After: ad hoc unsafe string helpers remained; now allocation and
bounded copies follow one portable pattern.
Technical Overview:
Add a complete asprintf and vasprintf compatibility layer with shared
prototypes so older libc variants build without local wrappers.
Replace repo-wide strcpy, strcat, strncat, sprintf, and direct strncpy
uses with explicit memcpy-based bounded copies or exact-width byte
copies as appropriate for each destination.
Add rsCStrAppendParts() for incremental string assembly so callers can
build pre-sized buffers without repeated snprintf return handling.
Update the unicode helper copy routine so existing ustrncpy() call sites
no longer route to libc strncpy semantics.
This also removes a broad class of review distractions from automated AI
reviewers that key off banned function names without understanding the
actual copy contract at the call site.
Extend the stringbuf unit coverage for the new append helper and the
formatted-allocation compatibility path.
With the help of AI-Agents: Codex
Corrupted disk queue recovery can leave a queue logically empty after\nstartup while late internal messages still land in the active write\nstream during shutdown. The disk queue destructor only deleted the\nwrite file when its offset was zero, which left orphaned mainq.* files\nbehind after recovery.\n\nTreat any active write file as disposable once the disk queue is\nlogically empty, and add a startup reproducer that corrupts bytes inside\na persisted disk-queue segment before restart.\n\nCloses https://github.com/rsyslog/rsyslog/issues/5085
Why
The queue security audit found several reachable validation and
recovery bugs in queue configuration, worker lifecycle handling,
and disk queue housekeeping. These failures were spread across the
v6 object parser, legacy sysline parser, and runtime recovery paths.
Impact
Queue configs now reject empty spool directories and non-positive
worker thread counts, and worker/disk queue recovery paths fail
closed instead of proceeding with inconsistent state.
Before/After
Before, invalid queue settings and worker startup/join failures could
slip through or corrupt recovery bookkeeping; after, those paths are
validated and handled deterministically with regression coverage.
Technical Overview
Switch queue worker thread settings to positive-integer handling in
both object config and legacy sysline plumbing.
Reject empty queue.spooldirectory values before trailing-slash
normalization and propagate cryprov initialization failures.
Keep the to-delete list sorted correctly and clear the duplicate-file
registry root after config teardown.
Avoid counting uninitialized stat data during multi-file queue seek and
reject invalid stream iMaxFiles values.
Handle pthread_create and pthread_join failures without leaving worker
state inconsistent, and add overflow guards to transactional worker
parameter growth.
Register focused regression tests for empty spool directories and zero
worker-thread configs and keep them in the distributed test set.
With the help of AI-Agents: Codex
This improves code readability and maintainability by correcting
typos in comments. While non-functional, clear comments reduce
confusion for contributors and support long-term project quality.
Before: comments contained minor typos and inconsistencies.
After: comments use corrected spelling and clearer wording.
Impact: none (no functional or behavioral changes).
The patch updates comment text only, without modifying logic,
interfaces, or runtime behavior. No changes to queue semantics,
transactions, or module interactions are introduced.
This aligns with ongoing maintenance efforts to keep the codebase
clean and easier to understand for contributors and reviewers.
Fixes: https://github.com/rsyslog/rsyslog/issues/6023
AI-Agent: Copilot 2026-03
Why:
This branch combines two related hardening steps for disk queue reliability:
- robust corruption detection/recovery handling in disk queue state/file validation
- worker startup cancellation-race closure that could lead to shutdown wait loops
Impact:
- disk queue scan now rejects out-of-range segment sequence numbers early and
reports corruption deterministically.
- worker startup no longer exposes a cancellation window before cleanup
registration.
- test/CI diagnostics preserve timeout backtraces (gdb) in ARM jobs and print
them to stdout for post-mortem debugging.
- test script cleanup removes redundant operations and uses a macOS-friendlier
segment enumeration path.
Technical Overview:
- runtime/queue.c:
- add out-of-range sequence-number rejection during spool scan
- keep orphan-loop range check as defensive fallback
- runtime/wtp.c:
- disable cancellation and register cleanup before publishing RUNNING
- document startup/cancellation invariant inline
- runtime/wti.c:
- add concise cancellation-contract comment
- devtools/ci/Dockerfile.arm:
- install gdb for CI timeout diagnostics
- tests/diskqueue-oncorruption-missing-segment.sh:
- emit timeout gdb backtraces to stdout
- drop redundant STARTED_LOG truncate
- avoid GNU find -printf/mapfile dependency in segment listing
Revert the problematic condition added in commit 4748c5746 that
activated the DA worker pool when disk queue (pqDA) has data.
Root Cause:
The DA worker pool (pWtpDA, ConsumerDA function) moves data FROM
the in-memory parent queue TO the disk queue. When activated with
an empty parent queue, it immediately terminates (parent below low
watermark), but the condition remains true, causing an infinite
start/stop loop.
Why the original logic was incorrect:
The commit misunderstood the queue architecture. It tried to solve
slow disk queue draining by activating the DA worker pool, but:
- DA worker pool: Moves memory → disk (for spillover)
- Disk queue workers: Process disk → actions (automatic on load)
When rsyslog restarts with persisted disk queue data:
1. pqDA (disk queue) is loaded from files
2. pqDA's own regular workers start automatically via qqueueStart()
3. Those workers process messages from disk
4. No DA worker pool activation needed!
Test Results:
- With buggy code: 372 DA worker starts, test unstable
- With revert: 2 DA worker starts (normal), 19/20 test passes
- The 1/20 failure is pre-existing test flakiness
The original issue #2646 likely had a different root cause that
needs separate investigation. This revert prevents the regression
while restoring system stability.
Fixes regression in test: daqueue-drain-without-traffic.sh
Relates to: issue #2646, commit 4748c5746
Why:
Disk-assisted queues were taking days to drain after recovery
because the DA worker only activated when the in-memory queue
reached the high watermark, creating a catch-22 when starting
with an empty memory queue but full disk queue.
Impact:
This fix enables proper recovery from backlogs and prevents data
loss from queues that cannot drain. Existing behavior for normal
operations is preserved.
Before:
DA worker only started when: memQueueSize >= highWatermark
After:
DA worker starts when: memQueueSize >= highWatermark OR
diskQueueSize > 0
Technical Overview:
Modified qqueueAdviseMaxWorkers() in runtime/queue.c to check
both the memory queue size against the high watermark (original
condition) and whether the disk queue (pqDA) has pending messages.
This ensures the DA worker activates whenever there is data on
disk to process, not just when new incoming traffic fills the
memory queue. The NULL check for pqDA prevents dereferencing
before the DA queue is initialized. This change maintains the
original high-watermark behavior while adding the recovery path.
closes https://github.com/rsyslog/rsyslog/issues/2646
With the help of AI-Agents: GitHub Copilot
Hardens disk-queue recovery after an invalid .qi so read/write pointers
realign and on-disk size is corrected. This prevents stuck queues and
stabilizes the daqueue dirty shutdown test.
Bug Fixes
- On anomaly (rd==wr and offsets equal), seek the read-delete cursor to
the writer, subtract deleted bytes from sizeOnDisk, and align the
read-dequeue cursor; keep draining if seek fails.
- Log errors when pointer resets or seeks fail.
- Add strm.Sync() to keep stream state consistent after pointer updates.
- Refactor invalid .qi recovery and startup seek errors into helpers.
- When spool read files are missing on startup, align read to write and
continue recovery.
With the help of AI-Agents: gpt-5.2-codex
Replace opaque/variadic callback usage with explicit, type-safe function
signatures to reduce undefined behavior and clarify intent.
Adapter helpers bridge the existing APIs without raw variadic casts, enabling
the transition incrementally. Callback setup sites are standardized for
consistent readability. This tightens the contract on callbacks, eases future
refactoring, and makes their roles more self-documenting.
Inspired by https://github.com/rsyslog/rsyslog/pull/5882
With AI support: Codex, Gemini
This commit applies the new canonical formatting style using `clang-format` with custom settings (notably 4-space indentation), as part of our shift toward automated formatting normalization.
⚠️ No functional changes are included — only whitespace and layout modifications as produced by `clang-format`.
This change is part of the formatting modernization strategy discussed in:
https://github.com/rsyslog/rsyslog/issues/5747
Key context:
- Formatting is now treated as a disposable view, normalized via tooling.
- The `.clang-format` file defines the canonical style.
- A fixup script (`devtools/format-code.sh`) handles remaining edge cases.
- Formatting commits are added to `.git-blame-ignore-revs` to reduce noise.
- Developers remain free to format code however they prefer locally.
* Fix issue with queue.maxDiskSpace validation
When queue.maxDiskSpace is set smaller than queue.maxfilesize, rsyslog
could enter an infinite loop during shutdown. This fix adds validation
to ensure maxDiskSpace is at least as large as maxFileSize.
If an invalid configuration is detected, the system will log a warning
and automatically adjust maxDiskSpace to match maxFileSize to prevent
the shutdown loop.
closes: https://github.com/rsyslog/rsyslog/issues/2693
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
This commit performs a broad modernization of widely used rsyslog
macros to align with modern C practices and support automated
formatting tools like clang-format. The changes focus on improving
syntactic regularity, readability, and tooling compatibility — without
altering behavior.
Macros refactored in this commit now follow a consistent,
statement-like form with explicit trailing semicolons. Where
applicable, macro blocks that define module interfaces (`queryEtryPt`)
have been updated to use simple `if` statements instead of `else if`
chains. While this slightly increases evaluation time, the affected
functions are only called once per module during load time to register
supported interfaces — making the performance cost irrelevant in
practice.
These improvements serve multiple purposes:
- Enable reliable clang-format usage without mangling macro logic
- Simplify reasoning about macro-expanded code for human readers
- Reduce style drift and merge conflicts
- Facilitate development for contributors using assistive tools
- Support future formatting pipelines using:
1. `clang-format`
2. a post-fixup normalization script
Refactored macros:
- MODULE_TYPE_NOKEEP
- MODULE_TYPE_KEEP
- MODULE_TYPE_INPUT
- MODULE_TYPE_OUTPUT
- MODULE_TYPE_FUNCTION
- MODULE_TYPE_PARSER
- MODULE_TYPE_LIB
- DEF_IMOD_STATIC_DATA
- DEF_OMOD_STATIC_DATA
- DEF_PMOD_STATIC_DATA
- DEF_FMOD_STATIC_DATA
- DEFobjStaticHelpers
- SIMP_PROP(...)
And all `queryEtryPt()` dispatch macros:
- CODEqueryEtryPt_STD_MOD_QUERIES
- CODEqueryEtryPt_STD_OMOD_QUERIES
- CODEqueryEtryPt_STD_OMODTX_QUERIES
- CODEqueryEtryPt_STD_OMOD8_QUERIES
- CODEqueryEtryPt_TXIF_OMOD_QUERIES
- CODEqueryEtryPt_IsCompatibleWithFeature_IF_OMOD_QUERIES
- CODEqueryEtryPt_STD_IMOD_QUERIES
- CODEqueryEtryPt_STD_CONF2_QUERIES
- CODEqueryEtryPt_STD_CONF2_setModCnf_QUERIES
- CODEqueryEtryPt_STD_CONF2_OMOD_QUERIES
- CODEqueryEtryPt_STD_CONF2_IMOD_QUERIES
- CODEqueryEtryPt_STD_CONF2_PREPRIVDROP_QUERIES
- CODEqueryEtryPt_STD_CONF2_CNFNAME_QUERIES
- CODEqueryEtryPt_STD_PMOD_QUERIES
- CODEqueryEtryPt_STD_PMOD2_QUERIES
- CODEqueryEtryPt_STD_FMOD_QUERIES
- CODEqueryEtryPt_STD_SMOD_QUERIES
- CODEqueryEtryPt_doHUPWrkr
- CODEqueryEtryPt_doHUP
This general modernization reduces macro misuse, improves DX, and
lays the foundation for a robust, automated style normalization
system.
See also: https://github.com/rsyslog/rsyslog/issues/5747
among others, remove some warning suppressions by "fixing" the
respective constructs with work-arounds (root cause is compilers
do not handle enums in switch well).
Note: The upcoming gnu23 C standard is overdoing it with type-safety. Inside
rsyslog, we historically have method tables for generic calls, which
keeps the code small and easy to understand. This would not decently be
possible with the new type-safety requirements.
So this commit works around these warning in a way that pretends to
provide more type safety. We have done this in the least intrusive
way to reduce the risk for regressions in code that works well in
since decades. Also note that the code already does parameter
validation.
There would have been more elaborate ways to make gnu23 compile happy,
e.g. by using a union of structs to provide the data element. Some folks
consider this type safe. In reality, it is not a bit better than
traditional C without types at all. Because the caller still needs to
ensure it picks the right struct from the union. As this approach
would also have larger regeression potential, we have not used it.
Right now, we have suppressed some of the new warnings, as working
around them would have required an even larger time budget and
potentially larger regression potential. In the long term we may
want to look into enabling them, as they would potentially be
beneficial for new code not involving method tables.
Some nits, however, were detected and have been fixed.
This patch also "fixes" some false positive test failures, mostly
be disabling some test functionality after confirmation these are
flakes.
see also https://github.com/rsyslog/rsyslog/issues/5507
When switching to Disk queue emergency mode, we destructed the in-memory
queue object. Practice has shown that this MAY cause races during
destruction which themselfs can lead to segfault. For that reason, we
now keep the disk queueu object. This will keep some ressources,
including disk space, allocated. But we prefer that over a segfault.
After all, it only happens after a serious queue error when we are
already at the edge of hard problems.
see also: https://github.com/rsyslog/rsyslog/issues/4963
Add NULL value handling for pDeqRoot. This caused seqfaults if
messages were discarded during dequeue.
Also fix iOverallQueueSize calculation (discarded items) in imdiag.
While building a testcase for issue #4437 , I discovered an issue with the
iOverallQueueSize counter not being substracting discarded messages. This caused
the testcase to fail with testcase timeout at the count of "discardMark" queue
setting.
closes: https://github.com/rsyslog/rsyslog/issues/4437
Direct queues do not apply queue parameters because they are actually
no physical queue. As such, any parameter set is ignored. This can
lead to unintentional results.
The new code detects this case and warns the user.
closes https://github.com/rsyslog/rsyslog/issues/77
This is a fine-tuning option which permits to control whether or not
rsyslog shall alays take the flow control setting from the message. If
so, non-primary queues may also block when reaching high water mark.
This permits to add some synchronous processing to rsyslog core engine.
However, it is dangerous, as improper use may make the core engine
stall. As such, enabling this option requires very careful planning
of the rsyslog configuration and deep understanding of the consequences.
Note that the option is applied to individual queues, so a configuration
with a large number of queues can (and must if use) be fine-tuned to
the exact use case.
The rsyslog team strongly recommends to let the option turned off,
which is the default setting.
see also https://github.com/rsyslog/rsyslog/issues/3941
This was added for a specific debug effort and obviously forgotten
to cleanup. This issue is not included in any scheduled release. It
was added a few days ago (we did not try to hunt down the exact
commit that caused it).
Thanks to github user eshadesu for alerting us.
closes https://github.com/rsyslog/rsyslog/issues/3955
This was a long-standing bug where the DA queue always had a fixed small batch
size because the setting was not propagated from the memory queue. This also
removes a needless and counter-productive "debug aid" which seemed to be in
the code for quite some while. It did not cause harm because of the batch
size issue.
The action name is stored in modified form for the debug header and
some messages. If it is extremely long, a buffer can be overrun,
resulting in misadressing and potential segfault for rsyslog. This
can also happen if the action is NOT named, but a custom path to
the output module is given and that path is very long. This triggers
the same issue because by default the module load path is included
in the action name.
This patch corrects the problem and trunctates overly long names
when being used for name generation.
The problem was detected during testbench work. We did never receive
a bug report from practice.
The queue subsystem now provides additional information messages which
may help a regular user to maintain system healt. Most importantly,
DA queues now output when they persist queue data at end of run and
when they restart the queue based on persisted data.
If the same name is specified for multiple queues, the queue files
will become corrupted. This commit adds a check during config parsing.
If duplicate names are detected the config parser errors out and the
related object is not created.
Note: this may look to a change-of-behaviour to some users. However,
this never worked and it was pure luck that these users did not run
into big problems (e.g. DA queues were never going to disk at the
same time). So it is acceptable to error out in this hard error case.
closes https://github.com/rsyslog/rsyslog/issues/1385
New semantic: if lightDelayMark is 0, it is set to the max queue
size, effectively disabling the "light delay" functionality.
Thanks to Yury Bushmelev to mentioning issues related to light
delay mark and propsing the solution (which actually is what
this commit does).
closes https://github.com/rsyslog/rsyslog/issues/1778
We see error reports from users who have configured excessively large queues
and receive an OOM condition or other problems.
With that patcjh we generate a warnomg message if a queue is configured very
large. "Very large" is defined to be in excess of 500000 messages.
see also https://github.com/rsyslog/rsyslog/issues/3314
closes https://github.com/rsyslog/rsyslog/issues/3334
while this is useful for users as well, we have done it so
that we can handle slow CI systems during CI runs. It is also
required for massively parallel testing, which makes each
individual test rather slow.
With the new settings, the testbench framework can now set
longer timeouts by defaults. Also updated framework accordingly.
gtls and ossl driver used a default buffersize of 8 x 1024 bytes to store
received TLS packets. When tls read returned more than buffersize, the additional
buffer was not processed until new data arrived on the socket again.
TLS RFCs require up to 16KB buffer for a single TLS record.
closes https://github.com/rsyslog/rsyslog/issues/3325