rsyslog/doc/messageparser.html

<html>
<head>
<title>Message parsers in rsyslog</title>
</head>
<body>
<a href="manual.html">rsyslog documentation</a>

<h1>Message parsers in rsyslog</h1>
<p><small><i>Written by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a>
(2009-11-06)</i></small></p>
<h2>Intro</h2>
<p>Message parsers are a feature of rsyslog 5.3.4 and above. In this article, I describe what
message parsers are, what they can do and how they relate to the relevant standards. I will
also describe what you can not do with time. Finally, I give some advice on implementing your
own custom parser.

<h2>What are message parsers?</h2>
<p>Well, the quick answer is that message parsers are the component of rsyslog that
parses the syslog message after it is being received. Prior to rsyslog 5.3.4, message parsers
where built in into the rsyslog core itself and could not be modified (other than by modifying
the rsyslog code).
<p>In 5.3.4, we changed that: message parsers are now loadable modules (just
like input and output modules). That means that new message parsers can be added without
modifying the rsyslog core, even without contributing something back to the
project.
<p>But that doesn't answer what a message parser really is. What does ist mean to &quot;parse a
message&quot; and, maybe more importantly, what is a message? To answer these questions correctly,
we need to dig down into the relevant standards.
<a href="http://tools.ietf.org/html/rfc5424">RFC5424</a> specifies a layered architecture
for the syslog protocol:
<p align="center"><img src="rfc5424layers.png" alt="RFC5424 syslog protocol layers">
<p>For us important is the distinction between the syslog transport and the upper layers.
The transport layer specifies how a stream of messages is assembled at the sender side and
how this stream of messages is disassembled into the individual messages at the receiver
side. In networking terminology, this is called &quot;framing&quot;. The core idea is that
each message is put into a so-called "frame", which then is transmitted over the communications
link.
<p>The framing used is depending on the protocol. For example, in UDP the "frame"-equivalent is
a packet that is being sent (this also means that no two messages can travel within a single
UDP packet). In "plain tcp syslog", the industry standard, LF is used as a frame delimiter
(which also means that no multi-line message can properly be transmitted, a "design" flaw
in plain tcp syslog). In <a href="http://tools.ietf.org/html/rfc5425">RFC5425</a> there is
a header in front of each frame that contains the size of the message. With this framing,
any message content can properly be transferred.
<p>And now comes the important part: <b>message parsers do NOT operate at the transport
layer</b>, they operate, as their name implies, on messages. So we can not use message
parsers to change the underlying framing. For example, if a sender splits (for whatever
reason) a single message into two and encapsulates these into two frames, there is no way
a message parser could undo that.
<p>A typical example may be a multi-line message: let's assume some originator has generated
a message for the format "A\nB" (where \n means LF). If that message is being transmitted
via plain tcp syslog, the frame delimiter is LF. So the sender will delimite the frame with
LF, but otherwise send the message unmodified onto the wire (because that is how things are
-unfortunately- done in plain tcp syslog...). So wire will see "A\nB\n". When this
arrives at the receiver, the transport layer will undo the framing. When it sees the LF
after A, it thinks it finds a valid frame delimiter (in fact, this is the correct view!). So
the receive will extract one complete message A and one complete message B, not knowing
that they once were both part of a large multi-line message. These two messages are then
passed to the upper layers, where the message parsers receive them and extract information.
However, the message parsers never know (or even have a chance to see) that A and B
belonged together. Even further, in rsyslog there is no guarnatee that A will be parsed
before B - concurrent operations may cause the reverse order (and do so very validly).
<p>The important lesson is: <b>message parsers can not be used to fix a broken framing</b>.
You need a full protocol implementation to do that, what is the domain of input and
output modules.
<p>I have now told you what you can not do with message parsers. But what they are good for?
Thankfully, broken framing is not the primary problem of the syslog world. A wealth of different
formats is. Unfortunately, many real-world implementations violate the relevant standards
in one way or another. That makes it often very hard to extract meaningful information from
a message or to process messages from different sources by the same rules. In my article
<a href="syslog_parsing.html">syslog parsing in rsyslog</a> I have elaborated on all
the real-world evil that you can usually see. So I won't repeat that here. But in short, the
real problem is not the framing, but how to make malformed messages well-looking.
<p><b>This is what message parsers permit you to do: take a (well-known) malformed message, parse
it according to its semantics and generate perfectly valid internal message representations
from it.</b> So as long as messages are consistenly in the same wrong format (and they usually
are!), a message parser can look at that format, parse it, and make the message processable just
like it were wellformed in the first place. Plus, one can abuse the interface to do some other
"intersting" tricks, but that would take us to far.
<p>While this functionality may not sound exciting, it actually solves a very big issue (that you
only really understand if you have managed a system with various different syslog sources).
Note that we were often able to process malformed messages in the past with the help of the
property replacer and regular expressions. While this is nice, it has a performance hit. A
message parser is a C code, compiled to native language, and thus typically much faster than
any regular expression based method (depending, of course, on the quality of the implementation...).

<h2>How are message parsers used?</h2>
<p>In a simlified view, rsyslog
<ol>
<li>first receives messages (via the input module),
<li><i>then parses them (at the message level!)</i> and
<li>then processes them (operating on the internal message representation).
</ol>
Message parsers are utilized in the second step (written in italics).
Thus, they take the raw message (NOT frame!) received from the remote system and create
the internal structure out of it that the other parts of rsyslog need in order to perform
their processing. Parsing is vital, because an unparsed message can not be processed in the
third stage, the actual application-level processing (like forwarding or writing to files).
<h3>Parser Chains and how they Operate</h3>
Rsyslog chains parsers together to provide flexibility.
A <b>parser chain</b>
contains all parsers that can potentially be used to parse a message.
It is assumed that there is some
way a parser can detect if the message it is being presented is supported by it. If so, the parser
will tell the rsyslog engine and parse the message. The rsyslog engine now calls each parser
inside the chain (in sequence!) until the first parser is able to parse the message. After one
parser has been found, the message is considered parsed and no others parsers are called on that
message.
<p>Side-note: this method implies there are some "not-so-dirty" tricks available to modify
the message by a parser module that declares itself as "unable to parse" but still does
some message modification. This was not a primary design goal, but may be utilized, and the
interface probably extended, to support generic filter modules. These would need to go
to the root of the parser chain. As mentioned, the current system already supports this.
<p>The position inside the parser chain can be thought of as a priority: parser sitting
earlier in the chain take precedence over those sitting later in it. So more specific
parser should go ealier in the chain. A good example of how this works is the default parser
set provided by rsyslog: rsyslog.rfc5424 and rsyslog.rfc3164, each one parses according to the
rfc that has named it. RFC5424 was designed to be distinguishable from RFC3164 message by the
sequence "1 " immediately after the so-called PRI-part (don't worry about these words, it is
sufficient if you understand there is a well-defined sequence used to indentify RFC5424
messages). In contrary, RFC3164 actually permits everything as a valid message. Thus the
RFC3164 parser will always parse a message, sometimes with quite unexpected outcome (there is
a lot of guesswork involved in that parser, which unfortunately is unavoidable due to
existing techology limits). So the default parser chain is to try the RFC5424 parser first
and after it the RFC3164 parser. If we have a 5424-formatted message, that parser will
identify and parse it and the rsyslog engine will stop processing. But if we receive a
legacy syslog message, the RFC5424 will detect that it can not parse it, return this status
to the engine which then calls the next parser inside the chain. That usually happens to be
the RFC3164 parser, which will always process the message. But there could also be any other
parser inside the chain, and then each one would be called unless one that is able to parse
can be found.
<p>If we reversed the parser order, RFC5424 messages would incorrectly parsed. Why? Because the
RFC3164 parser will always parse every message, so if it were asked first, it would parse
(and misinterpret) the 5424-formatted message, return it did so and the rsyslog engine would
never call the 5424 parser. So oder of sequence is very important.
<p>What happens if no parser in the chain could parse a message? Well, then we could not
obtain the in-memory representation that is needed to further process the message. In that
case, rsyslog has no other choice than to discard the message. If it does so, it will emit
a warning message, but only in the first 1,000 incidents. This limit is a safety measure
against message-loops, which otherwise could quickly result from a parser chain
misconfiguration. <b>If you do not tolerate loss of unparsable messages, you must ensure
that each message can be parsed.</b> You can easily achive this by always using the
"rsyslog-rfc3164" parser as the <i>last</i> parser inside parser chains. That may result
in invalid parsing, but you will have a chance to see the invalid message (in debug mode,
a warning message will be written to the debug log each time a message is dropped due to
inability to parse it).
<h3>Where are parser chains used?</h3>
<p>We now know what parser chains are and how they operate. The question is now how many
parser chains can be active and how it is decicded which parser chain is used on which message.
This is controlled via <a href="multi_ruleset.html">rsyslog's rulesets</a>. In short, multiple
rulesets can be defined and there always exist at least one ruleset (for specifcs, follow
the <a href="multi_ruleset.html">link</a>). A parser chain is bound to a specific ruleset.
This is done by virtue of defining parsers via the
<a href="rsconf1_rulesetparser.html">$RulesetParser</a> configuration directive (for specifics,
see there). If no such directive is specified, the default parser chain is used. As of this
writing, the default parser chain always consists of "rsyslog.rfc5424", "rsyslog.rfc3164", in
that order. As soon as a parser is configured, the default list is cleared and the new parser
is added to the end of the (initially empty) ruleset's parser chain.
<p>The important point to know is that parser chains are defined on a per-ruleset basis.
<h3>Can I use different parser chains for different devices?</h3>
<p>The correct answer is: generally yes, but it depends. First of all, remember that input
modules (and specific listeners) may be bound to specific rulesets. As parser chains "reside"
in rulesets, binding to a ruleset also binds to the parser chain that is bound to that ruleset.
As a number one prequisite, the input module must support binding to different rulesets. Not
all do, but their number is growing. For example, the important
<a href="imudp.html">imudp</a> and <a href="imtcp.html">imtcp</a> input modules support
that functionality. Those that do not (for example <a href="im3195">im3195</a>) can only
utilize the default ruleset and thus the parser chain defined in that ruleset.
<p>If you do not know if the input module in question supports ruleset binding, check
its documentation page. Those that support it have the requiered directives.
<p>Note that it is currently under evaluation if rsyslog will support binding parser chains
to specific inputs directly, without depending on the ruleset. There are some concerns that
this may not be necessary but adds considerable complexity to the configuration. So this may
or may not be possible in the future. In any case, if we decide to add it, input modules
need to support it, so this functionality would require some time to implement.
<p>The coockbook recipe for using different parsers for different devices is given
as an actual in-depth example in the <a href="rscon1_rulesetsparser.html">$RulesetParser</a>
configuration directive doc page. In short, it is acomplished by defining specific rulesets
for the required parser chains, definining different listener ports for each of the devices
with different format and binding these listeners to the correct ruleset (and thus parser
chains). Using that approach, a variety of different message formats can be supported
via a single rsyslog instance.

<h2>Which message parsers are available</h2>
<p>As of this writing, there exist only two message parsers, one for RFC5424 format and one for
legacy syslog (loosely described in
<a href="http://tools.ietf.org/html/rfc3164">RFC3164</a>). These parsers are built-in and
must not be explicitely loaded. However, message parsers can be added with relative ease
by anyone knowing to code in C. Then, they can be loaded via $ModLoad just like any
other loadable module. It is expected that the rsyslog project will be contributed additional
message parsers over time, so that at some point there hopefully is a rich choice of them
(I intend to add a browsable repository as soon as new parsers pop up).
<h3>How to write a message parser?</h3>
<p>As a prequisite, you need to know the exact format that the device is sending. Then, you need
moderate C coding skills, and a little bit of rsyslog internals. I guess the rsyslog specific part
should not be that hard, as almost all information can be gained from the existing parsers. They
are rather simple in structure and can be found under the "./tools" directory. They are named
pmrfc3164.c and pmrfc5424.c. You need to follow the usual loadable module guidelines.
It is my expectation that writing a parser should typically not take longer than a single
day, with maybe a day more to get aquainted with rsyslog. Of course, I am not sure if the number
is actually right.
<p>If you can not program or have no time to do it, Adiscon can also write a message parser
for you as
part of the <a href="http://www.rsyslog/professional-services">rsyslog professional services
offering</a>.
<h2>Conclusion</h2>
<p>Malformed syslog messages are a pain and unfortunately often seen in practice. Message parsers
provide a fast and efficient solution for this problem. Different parsers can be defined for
different devices, and they all convert message information into rsyslog's well-defined
internal format. Message parsers were first introduced in rsyslog 5.3.4 and also offer
some interesting ideas that may be explored in the future - up to full message normalization
capabilities. It is strongly recommended that anyone with a heterogenous environment take
a look at message parser capabilities.

<p>[<a href="rsyslog_conf.html">rsyslog.conf overview</a>] [<a href="manual.html">manual
index</a>] [<a href="http://www.rsyslog.com/">rsyslog site</a>]</p>
<p><font size="2">This documentation is part of the
<a href="http://www.rsyslog.com/">rsyslog</a> project.<br>
Copyright &copy; 2009 by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a> and
<a href="http://www.adiscon.com/">Adiscon</a>. Released under the GNU GPL version 3 or higher.</font></p>
</body>
</html>