mirror of
https://github.com/rsyslog/rsyslog.git
synced 2025-12-17 17:30:42 +01:00
223 lines
16 KiB
HTML
223 lines
16 KiB
HTML
<html>
|
|
<head>
|
|
<title>Message parsers in rsyslog</title>
|
|
</head>
|
|
<body>
|
|
<a href="manual.html">rsyslog documentation</a>
|
|
|
|
<h1>Message parsers in rsyslog</h1>
|
|
<p><small><i>Written by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a>
|
|
(2009-11-06)</i></small></p>
|
|
<h2>Intro</h2>
|
|
<p>Message parsers are a feature of rsyslog 5.3.4 and above. In this article, I describe what
|
|
message parsers are, what they can do and how they relate to the relevant standards. I will
|
|
also describe what you can not do with time. Finally, I give some advice on implementing your
|
|
own custom parser.
|
|
|
|
<h2>What are message parsers?</h2>
|
|
<p>Well, the quick answer is that message parsers are the component of rsyslog that
|
|
parses the syslog message after it is being received. Prior to rsyslog 5.3.4, message parsers
|
|
where built in into the rsyslog core itself and could not be modified (other than by modifying
|
|
the rsyslog code).
|
|
<p>In 5.3.4, we changed that: message parsers are now loadable modules (just
|
|
like input and output modules). That means that new message parsers can be added without
|
|
modifying the rsyslog core, even without contributing something back to the
|
|
project.
|
|
<p>But that doesn't answer what a message parser really is. What does ist mean to "parse a
|
|
message" and, maybe more importantly, what is a message? To answer these questions correctly,
|
|
we need to dig down into the relevant standards.
|
|
<a href="http://tools.ietf.org/html/rfc5424">RFC5424</a> specifies a layered architecture
|
|
for the syslog protocol:
|
|
<p align="center"><img src="rfc5424layers.png" alt="RFC5424 syslog protocol layers">
|
|
<p>For us important is the distinction between the syslog transport and the upper layers.
|
|
The transport layer specifies how a stream of messages is assembled at the sender side and
|
|
how this stream of messages is disassembled into the individual messages at the receiver
|
|
side. In networking terminology, this is called "framing". The core idea is that
|
|
each message is put into a so-called "frame", which then is transmitted over the communications
|
|
link.
|
|
<p>The framing used is depending on the protocol. For example, in UDP the "frame"-equivalent is
|
|
a packet that is being sent (this also means that no two messages can travel within a single
|
|
UDP packet). In "plain tcp syslog", the industry standard, LF is used as a frame delimiter
|
|
(which also means that no multi-line message can properly be transmitted, a "design" flaw
|
|
in plain tcp syslog). In <a href="http://tools.ietf.org/html/rfc5425">RFC5425</a> there is
|
|
a header in front of each frame that contains the size of the message. With this framing,
|
|
any message content can properly be transferred.
|
|
<p>And now comes the important part: <b>message parsers do NOT operate at the transport
|
|
layer</b>, they operate, as their name implies, on messages. So we can not use message
|
|
parsers to change the underlying framing. For example, if a sender splits (for whatever
|
|
reason) a single message into two and encapsulates these into two frames, there is no way
|
|
a message parser could undo that.
|
|
<p>A typical example may be a multi-line message: let's assume some originator has generated
|
|
a message for the format "A\nB" (where \n means LF). If that message is being transmitted
|
|
via plain tcp syslog, the frame delimiter is LF. So the sender will delimite the frame with
|
|
LF, but otherwise send the message unmodified onto the wire (because that is how things are
|
|
-unfortunately- done in plain tcp syslog...). So wire will see "A\nB\n". When this
|
|
arrives at the receiver, the transport layer will undo the framing. When it sees the LF
|
|
after A, it thinks it finds a valid frame delimiter (in fact, this is the correct view!). So
|
|
the receive will extract one complete message A and one complete message B, not knowing
|
|
that they once were both part of a large multi-line message. These two messages are then
|
|
passed to the upper layers, where the message parsers receive them and extract information.
|
|
However, the message parsers never know (or even have a chance to see) that A and B
|
|
belonged together. Even further, in rsyslog there is no guarnatee that A will be parsed
|
|
before B - concurrent operations may cause the reverse order (and do so very validly).
|
|
<p>The important lesson is: <b>message parsers can not be used to fix a broken framing</b>.
|
|
You need a full protocol implementation to do that, what is the domain of input and
|
|
output modules.
|
|
<p>I have now told you what you can not do with message parsers. But what they are good for?
|
|
Thankfully, broken framing is not the primary problem of the syslog world. A wealth of different
|
|
formats is. Unfortunately, many real-world implementations violate the relevant standards
|
|
in one way or another. That makes it often very hard to extract meaningful information from
|
|
a message or to process messages from different sources by the same rules. In my article
|
|
<a href="syslog_parsing.html">syslog parsing in rsyslog</a> I have elaborated on all
|
|
the real-world evil that you can usually see. So I won't repeat that here. But in short, the
|
|
real problem is not the framing, but how to make malformed messages well-looking.
|
|
<p><b>This is what message parsers permit you to do: take a (well-known) malformed message, parse
|
|
it according to its semantics and generate perfectly valid internal message representations
|
|
from it.</b> So as long as messages are consistenly in the same wrong format (and they usually
|
|
are!), a message parser can look at that format, parse it, and make the message processable just
|
|
like it were wellformed in the first place. Plus, one can abuse the interface to do some other
|
|
"intersting" tricks, but that would take us to far.
|
|
<p>While this functionality may not sound exciting, it actually solves a very big issue (that you
|
|
only really understand if you have managed a system with various different syslog sources).
|
|
Note that we were often able to process malformed messages in the past with the help of the
|
|
property replacer and regular expressions. While this is nice, it has a performance hit. A
|
|
message parser is a C code, compiled to native language, and thus typically much faster than
|
|
any regular expression based method (depending, of course, on the quality of the implementation...).
|
|
|
|
<h2>How are message parsers used?</h2>
|
|
<p>In a simlified view, rsyslog
|
|
<ol>
|
|
<li>first receives messages (via the input module),
|
|
<li><i>then parses them (at the message level!)</i> and
|
|
<li>then processes them (operating on the internal message representation).
|
|
</ol>
|
|
Message parsers are utilized in the second step (written in italics).
|
|
Thus, they take the raw message (NOT frame!) received from the remote system and create
|
|
the internal structure out of it that the other parts of rsyslog need in order to perform
|
|
their processing. Parsing is vital, because an unparsed message can not be processed in the
|
|
third stage, the actual application-level processing (like forwarding or writing to files).
|
|
<h3>Parser Chains and how they Operate</h3>
|
|
Rsyslog chains parsers together to provide flexibility.
|
|
A <b>parser chain</b>
|
|
contains all parsers that can potentially be used to parse a message.
|
|
It is assumed that there is some
|
|
way a parser can detect if the message it is being presented is supported by it. If so, the parser
|
|
will tell the rsyslog engine and parse the message. The rsyslog engine now calls each parser
|
|
inside the chain (in sequence!) until the first parser is able to parse the message. After one
|
|
parser has been found, the message is considered parsed and no others parsers are called on that
|
|
message.
|
|
<p>Side-note: this method implies there are some "not-so-dirty" tricks available to modify
|
|
the message by a parser module that declares itself as "unable to parse" but still does
|
|
some message modification. This was not a primary design goal, but may be utilized, and the
|
|
interface probably extended, to support generic filter modules. These would need to go
|
|
to the root of the parser chain. As mentioned, the current system already supports this.
|
|
<p>The position inside the parser chain can be thought of as a priority: parser sitting
|
|
earlier in the chain take precedence over those sitting later in it. So more specific
|
|
parser should go ealier in the chain. A good example of how this works is the default parser
|
|
set provided by rsyslog: rsyslog.rfc5424 and rsyslog.rfc3164, each one parses according to the
|
|
rfc that has named it. RFC5424 was designed to be distinguishable from RFC3164 message by the
|
|
sequence "1 " immediately after the so-called PRI-part (don't worry about these words, it is
|
|
sufficient if you understand there is a well-defined sequence used to indentify RFC5424
|
|
messages). In contrary, RFC3164 actually permits everything as a valid message. Thus the
|
|
RFC3164 parser will always parse a message, sometimes with quite unexpected outcome (there is
|
|
a lot of guesswork involved in that parser, which unfortunately is unavoidable due to
|
|
existing techology limits). So the default parser chain is to try the RFC5424 parser first
|
|
and after it the RFC3164 parser. If we have a 5424-formatted message, that parser will
|
|
identify and parse it and the rsyslog engine will stop processing. But if we receive a
|
|
legacy syslog message, the RFC5424 will detect that it can not parse it, return this status
|
|
to the engine which then calls the next parser inside the chain. That usually happens to be
|
|
the RFC3164 parser, which will always process the message. But there could also be any other
|
|
parser inside the chain, and then each one would be called unless one that is able to parse
|
|
can be found.
|
|
<p>If we reversed the parser order, RFC5424 messages would incorrectly parsed. Why? Because the
|
|
RFC3164 parser will always parse every message, so if it were asked first, it would parse
|
|
(and misinterpret) the 5424-formatted message, return it did so and the rsyslog engine would
|
|
never call the 5424 parser. So oder of sequence is very important.
|
|
<p>What happens if no parser in the chain could parse a message? Well, then we could not
|
|
obtain the in-memory representation that is needed to further process the message. In that
|
|
case, rsyslog has no other choice than to discard the message. If it does so, it will emit
|
|
a warning message, but only in the first 1,000 incidents. This limit is a safety measure
|
|
against message-loops, which otherwise could quickly result from a parser chain
|
|
misconfiguration. <b>If you do not tolerate loss of unparsable messages, you must ensure
|
|
that each message can be parsed.</b> You can easily achive this by always using the
|
|
"rsyslog-rfc3164" parser as the <i>last</i> parser inside parser chains. That may result
|
|
in invalid parsing, but you will have a chance to see the invalid message (in debug mode,
|
|
a warning message will be written to the debug log each time a message is dropped due to
|
|
inability to parse it).
|
|
<h3>Where are parser chains used?</h3>
|
|
<p>We now know what parser chains are and how they operate. The question is now how many
|
|
parser chains can be active and how it is decicded which parser chain is used on which message.
|
|
This is controlled via <a href="multi_ruleset.html">rsyslog's rulesets</a>. In short, multiple
|
|
rulesets can be defined and there always exist at least one ruleset (for specifcs, follow
|
|
the <a href="multi_ruleset.html">link</a>). A parser chain is bound to a specific ruleset.
|
|
This is done by virtue of defining parsers via the
|
|
<a href="rsconf1_rulesetparser.html">$RulesetParser</a> configuration directive (for specifics,
|
|
see there). If no such directive is specified, the default parser chain is used. As of this
|
|
writing, the default parser chain always consists of "rsyslog.rfc5424", "rsyslog.rfc3164", in
|
|
that order. As soon as a parser is configured, the default list is cleared and the new parser
|
|
is added to the end of the (initially empty) ruleset's parser chain.
|
|
<p>The important point to know is that parser chains are defined on a per-ruleset basis.
|
|
<h3>Can I use different parser chains for different devices?</h3>
|
|
<p>The correct answer is: generally yes, but it depends. First of all, remember that input
|
|
modules (and specific listeners) may be bound to specific rulesets. As parser chains "reside"
|
|
in rulesets, binding to a ruleset also binds to the parser chain that is bound to that ruleset.
|
|
As a number one prequisite, the input module must support binding to different rulesets. Not
|
|
all do, but their number is growing. For example, the important
|
|
<a href="imudp.html">imudp</a> and <a href="imtcp.html">imtcp</a> input modules support
|
|
that functionality. Those that do not (for example <a href="im3195">im3195</a>) can only
|
|
utilize the default ruleset and thus the parser chain defined in that ruleset.
|
|
<p>If you do not know if the input module in question supports ruleset binding, check
|
|
its documentation page. Those that support it have the requiered directives.
|
|
<p>Note that it is currently under evaluation if rsyslog will support binding parser chains
|
|
to specific inputs directly, without depending on the ruleset. There are some concerns that
|
|
this may not be necessary but adds considerable complexity to the configuration. So this may
|
|
or may not be possible in the future. In any case, if we decide to add it, input modules
|
|
need to support it, so this functionality would require some time to implement.
|
|
<p>The coockbook recipe for using different parsers for different devices is given
|
|
as an actual in-depth example in the <a href="rscon1_rulesetsparser.html">$RulesetParser</a>
|
|
configuration directive doc page. In short, it is acomplished by defining specific rulesets
|
|
for the required parser chains, definining different listener ports for each of the devices
|
|
with different format and binding these listeners to the correct ruleset (and thus parser
|
|
chains). Using that approach, a variety of different message formats can be supported
|
|
via a single rsyslog instance.
|
|
|
|
<h2>Which message parsers are available</h2>
|
|
<p>As of this writing, there exist only two message parsers, one for RFC5424 format and one for
|
|
legacy syslog (loosely described in
|
|
<a href="http://tools.ietf.org/html/rfc3164">RFC3164</a>). These parsers are built-in and
|
|
must not be explicitely loaded. However, message parsers can be added with relative ease
|
|
by anyone knowing to code in C. Then, they can be loaded via $ModLoad just like any
|
|
other loadable module. It is expected that the rsyslog project will be contributed additional
|
|
message parsers over time, so that at some point there hopefully is a rich choice of them
|
|
(I intend to add a browsable repository as soon as new parsers pop up).
|
|
<h3>How to write a message parser?</h3>
|
|
<p>As a prequisite, you need to know the exact format that the device is sending. Then, you need
|
|
moderate C coding skills, and a little bit of rsyslog internals. I guess the rsyslog specific part
|
|
should not be that hard, as almost all information can be gained from the existing parsers. They
|
|
are rather simple in structure and can be found under the "./tools" directory. They are named
|
|
pmrfc3164.c and pmrfc5424.c. You need to follow the usual loadable module guidelines.
|
|
It is my expectation that writing a parser should typically not take longer than a single
|
|
day, with maybe a day more to get aquainted with rsyslog. Of course, I am not sure if the number
|
|
is actually right.
|
|
<p>If you can not program or have no time to do it, Adiscon can also write a message parser
|
|
for you as
|
|
part of the <a href="http://www.rsyslog/professional-services">rsyslog professional services
|
|
offering</a>.
|
|
<h2>Conclusion</h2>
|
|
<p>Malformed syslog messages are a pain and unfortunately often seen in practice. Message parsers
|
|
provide a fast and efficient solution for this problem. Different parsers can be defined for
|
|
different devices, and they all convert message information into rsyslog's well-defined
|
|
internal format. Message parsers were first introduced in rsyslog 5.3.4 and also offer
|
|
some interesting ideas that may be explored in the future - up to full message normalization
|
|
capabilities. It is strongly recommended that anyone with a heterogenous environment take
|
|
a look at message parser capabilities.
|
|
|
|
<p>[<a href="rsyslog_conf.html">rsyslog.conf overview</a>] [<a href="manual.html">manual
|
|
index</a>] [<a href="http://www.rsyslog.com/">rsyslog site</a>]</p>
|
|
<p><font size="2">This documentation is part of the
|
|
<a href="http://www.rsyslog.com/">rsyslog</a> project.<br>
|
|
Copyright © 2009 by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a> and
|
|
<a href="http://www.adiscon.com/">Adiscon</a>. Released under the GNU GPL version 3 or higher.</font></p>
|
|
</body>
|
|
</html>
|