1. Invalid utf8 detection didn't handle 3 and 4-byte overlong encodings (2
byte overlong econdings were handled explicitly by rejection E0 and E1
start bytes). Unified checks for overlong encodings.
2. Surrogates U+D800..U+DFFF are not valid codepoints (Unicode Standard, D92)
3. Replacement of characters in invalid 3 or 4-bytes encodings was too
eager. It must not replace bytes which are valid UTF-8 sequences. For
example, in [0xE0 0xC2 0xA7] sequence the 0xC2 is invalid as a continuation
byte, but it starts a valid UTF8 symbol [0xC2 0xA7]. That is, with current
code processing the sequence will result in "???" but the correct result is "?§"
(provided that the replacement character is "?").
4. Various tests for UTF-8 invalid/valid sequences.
Also now permit interactivly running tests without explicitly setting
$srcdir. This now works if we are inside ./tests and fails, as before,
when we are in a different directory.
Detected by shellcheck via CodeFactor.io