Unless there is no available space, ensure that the label_str
passed into ws_label_strcpy is null terminated, in the cases
where the string to copy is the empty string, or begins with
invalid UTF-8.
Fix#18560. Fix#18551.
Format the input for display, by escaping some non printable characters,
using ws_label_strcpy().
In some cases with vsnprintf() this requires using a temporary buffer.
Add some debug checks for invalid UTF-8 errors.
The intention here is to pass dissection data directly to the column
API, and the column functions are responsible for formatting that
data for display. This avoids having to call format_text() before
adding a string to a column and separates the concerns better.
Display formatting is an UI concern.
hex_str_to_bytes_encoding() consumes pairs of hex digits (and
optional separator) to turn into bytes. It can return a pointer
to the character after the last digit consumed. Don't advance
the end pointer after a single unpaired digit that is not consumed
as part of the hex string returned.
tvb_get_string_bytes() can pass back the end offset. If conversion
fails, return the initial offset instead of zero to make repeated
calls easier in cases where the full length is not decoded due to
errors.
Relatedly, no dissector currently uses this return value, because
it's not useful currently.
The proto.h APIs expect valid UTF-8 so replace uses of format_text()
with a label copy function that just does formatting and does not
check for encoding errors. Avoid multiple levels of temporary
string allocations.
Make sure the copy does not truncate a multibyte character and
produce invalid strings. Add debug checks for UTF-8 encoding errors
instead.
We escape C0 and C1 control codes (because control codes)
and ASCII whitespace (and bell).
Overall the goal is to be more efficient and optimized and help
detect misuse of APIs by passing invalid UTF-8.
Add a unit test for ws_label_strcat.
Using a warning is probably too exalted for the current state
of the code, where UTF-8 errors are somewhat expected from
dissectors that are lax about input validation.
Use a debug level with its own "UTF-8" domain instead.
Using a dedicated domain allows to filter on encoding errors and
with some enhancements to the logging subsystem make them fatal
for tracking and debugging purposes.
Using a dedicated domain might have other drawbacks but for now
it seems like the best approach.
XML 1.0 allows valid UTF-8 characters, except for the ASCII control
characters other than tab, carriage return, and line feed.
(It does not allow form feed and vertical tab, so the allowed group is
not the same as the standard ctype.h isspace category. It also
allows but discourages DEL (\x7F).)
The characters cannot be included as character references of the
form &#xx; either; there is technically no way to include them.
Escape them as done prior to 89e96c1e77
but continue to leave bytes with the high bit set alone so that
UTF-8 printable characters are not escaped.
Fix#10445
Add Percent-encoding to the list of encoding types that Show
Packet Bytes can handle.
There's a function added to glib 2.66 to handle this for arbitrary
bytes that might have internal nulls (and which allows the result
to be non UTF-8), but we don't require that version yet, so extend
the existing function.
Related to #1084
Repeated words were found with:
egrep "(\b[a-zA-Z]+) +\1\b" . -Ir
and then manually reviewed.
Non-displayed strings (e.g., in comments)
were also corrected, to ease future review.
Remove ws_strdup_escape_char(). I don't think it is generic enough to keep,
and it does not seem very efficient either.
Remove string_replace(). This function was used in the GTK GUI.
Move epan_memmem() and epan_strcasestr() to wsutil/str_util.
Rename to ws_memmem() and ws_strcasestr(). Add compile time
check for a system implementation and use that if available.
We invoke those functions using a wrapper to avoid exposing
_GNU_SOURCE outside of the implementation.
Registering a preference module for a protocol filter name with
upper case letters aborts the program. Relax this restriction to
conform with the rules for protocols. The recommendation is still
to use all lower-case letters.
Fixes 070aeddf76.
We have two format_size()s, with and without wmem scoped memory.
Move the wmem version to wsutil and add a convenience macro to
use g_malloc()ed memory.
format_text uses the wrong bitmask when checking for two byte UTF-8
characters, resulting in rejecting half the possible two bytes characters,
including all of Arabic and Greek, and substituting REPLACEMENT CHARACTER
for them. Fixes#17070, and add some comments about the current behavior
that doesn't match existing comments.
format_text(alloc, string, strlen(string)) is a common idiom; provide
format_text_string(), which does the strlen(string) for you. (Any
string used in a %s to set the text of a protocol tree item, if it was
directly extracted from the packet, should be run through a format_text
routine, to ensure that it's valid UTF-8 and that control characters are
handled correctly.)
Update comments while we're at it.
Change-Id: Ia8549efa1c96510ffce97178ed4ff7be4b02eb6e
Reviewed-on: https://code.wireshark.org/review/38202
Petri-Dish: Guy Harris <gharris@sonic.net>
Tested-by: Petri Dish Buildbot
Reviewed-by: Guy Harris <gharris@sonic.net>
It's a "wmem version" of format_size (from wsutil/str_util.h).
Also improved the flexibility in formatting of format_size() to handle future
needs of format_size_wmem
Ping-Bug: 15360
Change-Id: Id9977bbd7ec29375bbac955f685d46e75b0cef2c
Reviewed-on: https://code.wireshark.org/review/31233
Petri-Dish: Michael Mann <mmann78@netscape.net>
Tested-by: Petri Dish Buildbot
Reviewed-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: Anders Broman <a.broman58@gmail.com>
Change all wireshark.org URLs to use https.
Fix some broken links while we're at it.
Change-Id: I161bf8eeca43b8027605acea666032da86f5ea1c
Reviewed-on: https://code.wireshark.org/review/34089
Reviewed-by: Guy Harris <guy@alum.mit.edu>
Note that even strings fetched with ENC_ASCII may contain them - bytes
with the 8th bit set get mapped to REPLACEMENT CHARACTER.
This means we can format STR_UNICODE fields with format_text(); do so.
Bug: 1372
Change-Id: Ia32c3a92d220ac5174ecd25f33e2d1f85cfb8cb8
Reviewed-on: https://code.wireshark.org/review/34080
Reviewed-by: Guy Harris <guy@alum.mit.edu>
It was using the same index into the input and output strings, which
means that if it escaped any character, it would skip the next two
characters in the input sring.
It was also not clearing is_reserved before testing whether a character
was reserved, so once it saw a character that neede dto be escaped, it
would escape all subsequent characters.
It was only used in get_key_string(), which was never used, so it was
dead code, but let's at least fix it, even if we end up removing that
code, so that if we bring it back, we bring back a non-broken version,
and so that if anybody *else* uses it, it's not broken.
Change-Id: I36588efad36908e012023bcfbd813c749a6a254f
Reviewed-on: https://code.wireshark.org/review/33287
Petri-Dish: Guy Harris <guy@alum.mit.edu>
Tested-by: Petri Dish Buildbot
Reviewed-by: Guy Harris <guy@alum.mit.edu>
Found by clang-tidy.
Change-Id: Ibedfec5e5d3eca7c2e65319b7ecb4dcbe974b88b
Reviewed-on: https://code.wireshark.org/review/31337
Petri-Dish: Dario Lombardo <lomato@gmail.com>
Petri-Dish: Guy Harris <guy@alum.mit.edu>
Tested-by: Petri Dish Buildbot
Reviewed-by: Anders Broman <a.broman58@gmail.com>
Change-Id: Ic6de84a37b501e9c62a7d37071b2b081a1a1dd50
Reviewed-on: https://code.wireshark.org/review/19885
Petri-Dish: Michael Mann <mmann78@netscape.net>
Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org>
Reviewed-by: Michael Mann <mmann78@netscape.net>
All cases of the "original" format_text have been handled to add the
proper wmem allocator scope. Remove the "original" format_text
and replace it with one that has a wmem allocator as a parameter.
Change-Id: I278b93bcb4a17ff396413b75cd332f5fc2666719
Reviewed-on: https://code.wireshark.org/review/19884
Petri-Dish: Michael Mann <mmann78@netscape.net>
Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org>
Reviewed-by: Michael Mann <mmann78@netscape.net>
This allows for a wmem_allocator for users of format_text who want
it (dissectors for wmem_packet_scope()). This lessens the role of
current format_text functionality in hopes that it will eventually
be replaced.
Change-Id: I970557a65e32aa79634a3fcc654ab641b871178e
Reviewed-on: https://code.wireshark.org/review/19855
Reviewed-by: Michael Mann <mmann78@netscape.net>
format_text_wsp is fed into by tvb_format_text_wsp and tvb_format_stringzpad_wsp
so those functions need to add a wmem allocated parameter as well.
Most of the changes came from tvb_format_text_wsp and tvb_format_stringzpad_wsp
being changed more so than format_text_wsp.
Change-Id: I52214ca107016f0e96371a9a8430aa89336f91d7
Reviewed-on: https://code.wireshark.org/review/19851
Petri-Dish: Michael Mann <mmann78@netscape.net>
Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org>
Reviewed-by: Michael Mann <mmann78@netscape.net>
Change-Id: Idcea59f6fc84238f04d9ffc11a0088ef97beec0c
Reviewed-on: https://code.wireshark.org/review/19844
Petri-Dish: Michael Mann <mmann78@netscape.net>
Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org>
Reviewed-by: Michael Mann <mmann78@netscape.net>
Casting a signed char with a negative value to int will preserve the
value, so it'll still be a negative subscript. Cast to guchar instead,
to make sure 0x80 through 0xFF are treated as 128 to 255, not -128 to
-1.
Change-Id: I1f0b33ba3686e963d45317b45465ff335431d17f
Reviewed-on: https://code.wireshark.org/review/4742
Reviewed-by: Guy Harris <guy@alum.mit.edu>
C neither guarantees that char is signed nor that it's unsigned. Make
the str_to_nibble tables arrays of gint8, to make sure they can hold
numbers between 0 and 15 as well as -1. Cast gchar to guchar, not int,
when using it as a subscript into that array, so that the subscripts are
in the range 0 to 255, not -128 to 127.
Change-Id: Ib85de5aa4e83ae9efd808c78ce3f86f45b4a3f2a
Reviewed-on: https://code.wireshark.org/review/4734
Reviewed-by: Guy Harris <guy@alum.mit.edu>
Revert gafa8c02 since it didn't work on Windows. Use a pragma to squelch
Visual C++ instead.
Qt's rich text renderer doesn't handle "'". Replace it with "'".
Remove a QDebug include.
Change-Id: I0e6308efda74a4bc0e67ce841a50a0a9b68f4a8b
Reviewed-on: https://code.wireshark.org/review/4511
Reviewed-by: Gerald Combs <gerald@wireshark.org>
This commit adds tvb_get_string_bytes and proto_tree_add_bytes_item routines for
getting GByteArrays fields from the tvb when they are encoded in ASCII hex string form.
The proto_tree_add_bytes_item routine is also usable for normal
binary encoded byte arrays, and has the advantage of retrieving
the array values even if there's no proto tree.
It also exposes the routines to Lua, both so that a Lua script can take
advantage of this, but also so I can write a testsuite to test the functions.
Change-Id: I112a038653df6482a5d0ebe7c95708f207319e20
Reviewed-on: https://code.wireshark.org/review/1158
Reviewed-by: Hadriel Kaplan <hadrielk@yahoo.com>
Reviewed-by: Anders Broman <a.broman58@gmail.com>
(Using sed : sed -i '/^ \* \$Id\$/,+1 d')
Fix manually some typo (in export_object_dicom.c and crc16-plain.c)
Change-Id: I4c1ae68d1c4afeace8cb195b53c715cf9e1227a8
Reviewed-on: https://code.wireshark.org/review/497
Reviewed-by: Anders Broman <a.broman58@gmail.com>