diff options
author | John Thacker <johnthacker@gmail.com> | 2023-06-07 08:07:44 -0400 |
---|---|---|
committer | John Thacker <johnthacker@gmail.com> | 2023-06-08 11:25:54 +0000 |
commit | 1744ce4a0f3af5adef40a19afcede29dd9c57e0c (patch) | |
tree | 3a46dc7b00c2f4b4b312f9e676e5cc6c53dc1ca4 /doc | |
parent | bda350d8fdab87c4cb54358abb44a0efa6cc5968 (diff) |
epan: Add ENC_BOM modifier for UTF-16, UCS-2, UCS-4
Add ENC_BOM to the list of bitflag modifiers, and use it with
UTF-16, UCS-2, and UCS-4 (UTF-32). If set, this means that the
first 2 (or 4) octets, if present, are checked to see if they are
a Big-Endian BYTE ORDER MARK ("ZERO WIDTH NON-BREAKING SPACE"). If so,
those octets are skipped and the encoding is set to Little-Endian
or Big-Endian depending on endianness of the BOM.
If the BOM is absent, the passed in Endianness flag is used normally.
Related to #17991
Diffstat (limited to 'doc')
-rw-r--r-- | doc/README.dissector | 17 |
1 files changed, 10 insertions, 7 deletions
diff --git a/doc/README.dissector b/doc/README.dissector index e0dc5a6be6..f61e76c913 100644 --- a/doc/README.dissector +++ b/doc/README.dissector @@ -1500,12 +1500,15 @@ order. For string fields, the encoding specifies the character set used for the string and the way individual code points in that character set are encoded. For FT_UINT_STRING fields, the byte order of the count must be -specified; for UCS-2 and UTF-16, the byte order of the encoding must be -specified (for counted UCS-2 and UTF-16 strings, the byte order of the -count and the 16-bit values in the string must be the same). In other -cases the string encoding has no endianness or the endianness is implicitly -specified and nothing should be used. The character encodings that are -currently supported are: +specified. For UTF-16, UCS-2, and UCS-4, the byte order of the encoding +must be specified, and optionally ENC_BOM can also be indicated to detect +an initial BYTE ORDER MARK (the specified value is used if the field does +not begin with a BOM.) For counted UTF-16, UCS-2, and UCS-4 strings, the +byte order of the count and the characters in the string must be the same, +unless a BOM overrides the value for the characters. In other cases the +string encoding has no endianness or the endianness is implicitly specified +and nothing should be used. The character encodings that are currently +supported are: ENC_ASCII - ASCII (currently treated as UTF-8; in the future, all bytes with the 8th bit set will be treated as @@ -1514,7 +1517,7 @@ currently supported are: ENC_UTF_16 - UTF-16-encoded Unicode, with surrogate pairs ENC_UCS_2 - UCS-2-encoded subset of Unicode, with no surrogate pairs and thus no code points above 0xFFFF - ENC_UCS_4 - UCS-4-encoded Unicode + ENC_UCS_4 - UCS-4-encoded Unicode (aka UTF-32) ENC_WINDOWS_1250 - Windows-1250 code page ENC_WINDOWS_1251 - Windows-1251 code page ENC_WINDOWS_1252 - Windows-1252 code page |