From e28ef20c8bbf887257f6b410def5dd5d2820044f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Valverde?= Date: Mon, 26 Sep 2022 23:28:32 +0100 Subject: [PATCH] README.Developer: Add notes about string encoding and best-practices --- doc/README.developer | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/doc/README.developer b/doc/README.developer index c09a8cacc1..d9281c9f8e 100644 --- a/doc/README.developer +++ b/doc/README.developer @@ -917,6 +917,32 @@ is also an essential component of a plugin system (libwireshark has plugins for taps, dissectors and an experimental interface to augment dissection with new extension languages). +7.5 Unicode and string encoding best practices + +Wireshark strings are always encoded in UTF-8 internally, regardless of the platform +where it is running. The C datatype used is "pointer to char" and this is assumed +to point to a valid UTF-8 string. Sometimes older code uses char to point to opaque +byte strings but this archaic usage should be avoided. A better data type +for that is uint8_t. + +Every untrusted string needs to be validated for correct and error-free UTF-8 +encoding, or converted from the source encoding to UTF-8. This should be done +at the periphery of the code. This means converting input during dissection or +when reading input generally. To reiterate: all the Wireshark APIs expect to +receive valid UTF-8 strings. These include proto_tree_add_string(), +proto_item_append_text() and col_append_fstr() just to name a few. + +If a dissector uses standard API functions to handle strings, such as +proto_tree_add_item() with an FT_STRING header field type, the API will +transparently handle the conversion from the source encoding to UTF-8 and +nothing else needs to be done to ensure valid string input. + +If your dissector does text manipulation, token parsing and such and generally +extracts text strings from the TVBuff or tries to do line oriented input from +TVBuffs it *must* make sure it passes only valid UTF-8 to libwireshark APIs. +This should be done using tvb_get_string_enc() to extract a string from a TVbuff +or get_utf_8_string() to validate a string after it has been constructed. + 8. Miscellaneous notes Each commit in your branch corresponds to a different VCSVERSION string