aboutsummaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authorJohn Thacker <johnthacker@gmail.com>2023-03-19 14:12:42 -0400
committerJohn Thacker <johnthacker@gmail.com>2023-03-19 21:14:12 +0000
commit5426af1492ccb2c2c8877662dca677122460a868 (patch)
tree5a608f53a98099fe1e7ecbecf35c17cbb6613219 /doc
parent155dc0f92db797d51825eb01431a5443583d8913 (diff)
doc: Don't refer to tvb_get_faked_unicode
tvb_get_faked_unicode hasn't existed for nine years. Recommend tvb_get_string_enc() instead.
Diffstat (limited to 'doc')
-rw-r--r--doc/README.developer72
1 files changed, 46 insertions, 26 deletions
diff --git a/doc/README.developer b/doc/README.developer
index 69b0e8c32c..f1582adaf1 100644
--- a/doc/README.developer
+++ b/doc/README.developer
@@ -335,7 +335,7 @@ Do not use "open()", "rename()", "mkdir()", "stat()", "unlink()", "remove()",
"fopen()", "freopen()" directly. Instead use "ws_open()", "ws_rename()",
"ws_mkdir()", "ws_stat()", "ws_unlink()", "ws_remove()", "ws_fopen()",
"ws_freopen()": these wrapper functions change the path and file name from
-UTF8 to UTF16 on Windows allowing the functions to work correctly when the
+UTF-8 to UTF-16 on Windows allowing the functions to work correctly when the
path or file name contain non-ASCII characters.
Also, use ws_read(), ws_write(), ws_lseek(), ws_dup(), ws_fstat(), and
@@ -617,22 +617,6 @@ and
2) it won't crash trying to allocate an absurdly-large chunk of
memory if the size field has a bogus large value.
-If you're fetching into such a chunk of memory a string from the buffer,
-and the string has a specified size, you can use "tvb_get_*_string()",
-which will check whether the entire string is present before allocating
-a buffer for the string, and will also put a trailing '\0' at the end of
-the buffer.
-
-If you're fetching into such a chunk of memory a 2-byte Unicode string
-from the buffer, and the string has a specified size, you can use
-"tvb_get_faked_unicode()", which will check whether the entire string
-is present before allocating a buffer for the string, and will also
-put a trailing '\0' at the end of the buffer. The resulting string will be
-a sequence of single-byte characters; the only Unicode characters that
-will be handled correctly are those in the ASCII range. (Wireshark's
-ability to handle non-ASCII strings is limited; it needs to be
-improved.)
-
If you're fetching into such a chunk of memory a sequence of bytes from
the buffer, and the sequence has a specified size, you can use
"tvb_memdup()", which will check whether the entire sequence is present
@@ -644,16 +628,47 @@ TVB-accessor routines can handle requests to read data beyond the end of
the TVB (by throwing an exception which will either mark the frame as
truncated--not all the data was captured--or as malformed).
+If you're fetching a string only to add it to the tree, you should
+generally be using "proto_tree_add_item()" instead. If you also need
+the string, you can use the variant "proto_tree_add_item_ret_string()"
+or "proto_tree_add_item_ret_string_and_length()" forms.
+
+If you must fetch it from the tvbuff, and the string has a specified
+size and known encoding, you can use "tvb_get_string_enc()" for most
+encodings, which will check whether the entire string is present before
+allocating a buffer for the string, will put a trailing '\0' at the end
+of the buffer, and will also check for invalid characters in the supplied
+encoding and convert the string to UTF-8. The "tvb_get_*_string()" set of
+functions is available as well, and must be used for some encodings,
+primarily non byte aligned ones. If the string has a known encoding and
+is null terminated, the "stringz" variants can be used. (Note that these
+functions are called with memory allocators, and if called with a NULL
+allocator you are required to free the string when finished with it.)
+
+If the string has a known encoding but requires token parsing or other
+text manipulation to determine the offset and size, do so by calling
+tvb_*() functions on the tvbuff that perform bounds checking if possible.
+Only extract the bytes into a newly allocated buffer to extract a string
+if absolutely necessary. If you do so, then you *must* ensure that the
+string is valid UTF-8 when passing it to a libwireshark API function
+such as proto_tree_add_string(). (Cf. 7.5: Unicode and string encoding
+best practices.)
+
+Conversion to UTF-8 can produce a string with a length longer than
+that of the string in the original packet data; this includes strings
+encoded in ASCII or UTF-8 itself if they have invalid character sequences
+that are replaced with the 3 byte UTF-8 REPLACEMENT CHARACTER. Truncating
+a valid UTF-8 string to an arbitrary number of bytes does not gurarantee
+that the result is a valid UTF-8 string, because a multibyte character
+might span the boundary.
+
Note also that you should only fetch string data into a fixed-length
buffer if the code ensures that no more bytes than will fit into the
buffer are fetched ("the protocol ensures" isn't good enough, as
protocol specifications can't ensure only packets that conform to the
specification will be transmitted or that only packets for the protocol
in question will be interpreted as packets for that protocol by
-Wireshark). If there's no maximum length of string data to be fetched,
-routines such as "tvb_get_*_string()" are safer, as they allocate a buffer
-large enough to hold the string. (Note that some variants of this call
-require you to free the string once you're finished with it.)
+Wireshark).
If you have gotten a pointer using "tvb_get_ptr()" (which you should not
have: you should seriously consider a better alternative to this function),
@@ -919,11 +934,11 @@ new extension languages).
7.5 Unicode and string encoding best practices
-Wireshark strings are always encoded in UTF-8 internally, regardless of the platform
-where it is running. The C datatype used is "pointer to char" and this is assumed
-to point to a valid UTF-8 string. Sometimes older code uses char to point to opaque
-byte strings but this archaic usage should be avoided. A better data type
-for that is uint8_t.
+Wireshark strings are always encoded in UTF-8 internally, regardless of the
+platform where it is running. The C datatype used is "pointer to char" and this
+is assumed to point to a valid UTF-8 string. Sometimes older code uses char to
+point to opaque byte strings but this archaic usage should be avoided. A better
+data type for that is uint8_t.
Every untrusted string needs to be validated for correct and error-free UTF-8
encoding, or converted from the source encoding to UTF-8. This should be done
@@ -943,6 +958,11 @@ TVBuffs it *must* make sure it passes only valid UTF-8 to libwireshark APIs.
This should be done using tvb_get_string_enc() to extract a string from a TVbuff
or get_utf_8_string() to validate a string after it has been constructed.
+The Qt API uses UTF-16 for its QString class; when converting between a
+QString and a pointer to char, functions that convert to or from UTF-8
+encoded pointers to char (or QByteArrays) such as toUtf8() should be used,
+not toLocal8Bit() or toLatin1().
+
8. Miscellaneous notes
Each commit in your branch corresponds to a different VCSVERSION string