aboutsummaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authorJohn Thacker <johnthacker@gmail.com>2023-06-07 08:07:44 -0400
committerJohn Thacker <johnthacker@gmail.com>2023-06-08 11:25:54 +0000
commit1744ce4a0f3af5adef40a19afcede29dd9c57e0c (patch)
tree3a46dc7b00c2f4b4b312f9e676e5cc6c53dc1ca4 /doc
parentbda350d8fdab87c4cb54358abb44a0efa6cc5968 (diff)
epan: Add ENC_BOM modifier for UTF-16, UCS-2, UCS-4
Add ENC_BOM to the list of bitflag modifiers, and use it with UTF-16, UCS-2, and UCS-4 (UTF-32). If set, this means that the first 2 (or 4) octets, if present, are checked to see if they are a Big-Endian BYTE ORDER MARK ("ZERO WIDTH NON-BREAKING SPACE"). If so, those octets are skipped and the encoding is set to Little-Endian or Big-Endian depending on endianness of the BOM. If the BOM is absent, the passed in Endianness flag is used normally. Related to #17991
Diffstat (limited to 'doc')
-rw-r--r--doc/README.dissector17
1 files changed, 10 insertions, 7 deletions
diff --git a/doc/README.dissector b/doc/README.dissector
index e0dc5a6be6..f61e76c913 100644
--- a/doc/README.dissector
+++ b/doc/README.dissector
@@ -1500,12 +1500,15 @@ order.
For string fields, the encoding specifies the character set used for the
string and the way individual code points in that character set are
encoded. For FT_UINT_STRING fields, the byte order of the count must be
-specified; for UCS-2 and UTF-16, the byte order of the encoding must be
-specified (for counted UCS-2 and UTF-16 strings, the byte order of the
-count and the 16-bit values in the string must be the same). In other
-cases the string encoding has no endianness or the endianness is implicitly
-specified and nothing should be used. The character encodings that are
-currently supported are:
+specified. For UTF-16, UCS-2, and UCS-4, the byte order of the encoding
+must be specified, and optionally ENC_BOM can also be indicated to detect
+an initial BYTE ORDER MARK (the specified value is used if the field does
+not begin with a BOM.) For counted UTF-16, UCS-2, and UCS-4 strings, the
+byte order of the count and the characters in the string must be the same,
+unless a BOM overrides the value for the characters. In other cases the
+string encoding has no endianness or the endianness is implicitly specified
+and nothing should be used. The character encodings that are currently
+supported are:
ENC_ASCII - ASCII (currently treated as UTF-8; in the future,
all bytes with the 8th bit set will be treated as
@@ -1514,7 +1517,7 @@ currently supported are:
ENC_UTF_16 - UTF-16-encoded Unicode, with surrogate pairs
ENC_UCS_2 - UCS-2-encoded subset of Unicode, with no surrogate pairs
and thus no code points above 0xFFFF
- ENC_UCS_4 - UCS-4-encoded Unicode
+ ENC_UCS_4 - UCS-4-encoded Unicode (aka UTF-32)
ENC_WINDOWS_1250 - Windows-1250 code page
ENC_WINDOWS_1251 - Windows-1251 code page
ENC_WINDOWS_1252 - Windows-1252 code page