Commit Graph

38 Commits

Author SHA1 Message Date
Bernhard Dick 75fb2e770c DECT-NWK: Add basic support for DECT charsets 2022-12-21 21:30:20 +00:00
John Thacker 7a4d05d63a charsets: Don't add illegal Unicode codepoints for UTF-16, UTF-32
If a character is not a valid Unicode codepoint, i.e. one of
the code points reserved for surrogate pairs or a code point
above 0x10FFFF, don't add it to a wmem_strbuf when converting
from other encodings but add a replacement character instead, by
using a new wmem_strbuf_append_unichar_validated() function.

Now we produce valid UTF-8 in various situations where UCS-2 or UTF-32
can encode unpaired surrogate codepoints. Consolidate some related
checks that are now redundant.

Also add a replacement character to the end of invalid UCS-2 strings
with an odd number of bytes, as done with UTF-16 and UTF-32.

Fix #18508
2022-10-19 07:53:02 -04:00
John Thacker 5bc8cac5cc charsets: UCS-4 code points above 0x10FFFFF are not legal
When decoding UCS-4/UTF-32, map Unicode code points above
0x10FFFFF to REPLACEMENT CHARACTER, as they are not legal,
and would create invalid UTF-8.
Also if the number of bytes given is not a multiple of 4,
insert a replacement character at the end as well.

This is two long standing todos. Fixes #18435.
2022-10-11 20:40:09 -04:00
João Valverde 9ab1f35641 Move print_hex_data_buffer() to wsutil
Move this generic function to wsutil so it can be used
by other libraries.
2022-10-08 12:39:04 +01:00
Moshe Kaplan 1c3a9af869 Add files with WS_DLL_PUBLIC to Doxygen
Add @file markers for most files that
contain functions exported with
WS_DLL_PUBLIC so that Doxygen will
generate documentation for them.
2021-11-29 21:27:45 +00:00
John Thacker e20bd408de Use iconv to support GB 18030 and EUC-KR, allow future encodings
Add support internally to using iconv (always present with glib) to convert
strings from various encodings to UTF-8 (using REPLACEMENT CHARACTER as
recommended), and use that to support GB 18030 and EUC-KR. Replace call
directly to iconv in ANSI 637 for EUC-KR to new API. Update comments
and documentation around character encodings. It is possible to replace
the calls to iconv with an internal decoder later. Tested on Linux and
on Windows (including with illegal characters). Closes #16630.
2020-10-21 11:26:23 +00:00
John Thacker 91b792c6dc Replace ill-formed UTF-8 byte sequences with replacement character
Implement the Unicode Standard "best practices" for replacing ill-formed
sequences with the Unicode REPLACEMENT CHARACTER. Add wmem_strbuf_append_len
for appending strings with embedded null characters. Clarify why
wmem_strbuf_grow() doesn't always ensure that there's enough room for
a new string, and short-circuit some tests there. Related to #14948
2020-10-15 21:48:28 +00:00
Guy Harris c597927da8 Add some more string encodings.
Add an encoding for "unpacked" 3GPP TS 23.038 7-bit strings, in which
each code position is in a byte of its own, rather than with the code
positions packed into 7 bits.  Rename the packed encoding to explicitly
indicate that it's packed.

Add an encoding for ETSI TS 102 221 Annex A strings.

Use the new encodings.
2020-09-28 22:30:35 +00:00
Guy Harris 20800366dd HTTPS (almost) everywhere.
Change all wireshark.org URLs to use https.

Fix some broken links while we're at it.

Change-Id: I161bf8eeca43b8027605acea666032da86f5ea1c
Reviewed-on: https://code.wireshark.org/review/34089
Reviewed-by: Guy Harris <guy@alum.mit.edu>
2019-07-26 18:44:40 +00:00
Guy Harris e26e0b4de0 Add support for the ISO 646 "Basic code table" encoding.
The "Basic code table" in ISO 646 is mostly ASCII, but some code points
either 1) have more than one glyph that can be assigned to them or 2)
have no glyph assigned to them.  National versions choose one of the two
glyphs for the code points in group 1) and assign specific glyphs to the
code points in group 2); the International Reference Version assigns the
same glyphs to those code points as does ASCII.

For the "Basic code table" encoding, we map the code points in groups 1)
and 2) to a REPLACEMENT CHARACTER; additional encodings can be added for
the national versions.

Add ENC_ISO_646_IRV (International Reference Version) as an alias for
ENC_ASCII.

Expand some comments, and add some comments, while we're at it.

Change-Id: I4f1b5e426ec193775e919731c5cae1224dc65115
Reviewed-on: https://code.wireshark.org/review/33941
Petri-Dish: Guy Harris <guy@alum.mit.edu>
Tested-by: Petri Dish Buildbot
Reviewed-by: Guy Harris <guy@alum.mit.edu>
2019-07-15 07:50:30 +00:00
Guy Harris 03c5da8d89 Add Windows code page 1252.
While we're at it, add the Euro to code page 1251, expand the comments
for 1250 and 1251 and some DOS code pages, and add support for code page
1251 to tvb_get_stringz_enc().

Change-Id: I053d58f87cac26ad7c109e2f1cd8807ffec0622d
Reviewed-on: https://code.wireshark.org/review/33342
Petri-Dish: Guy Harris <guy@alum.mit.edu>
Tested-by: Petri Dish Buildbot
Reviewed-by: Guy Harris <guy@alum.mit.edu>
2019-05-25 01:07:36 +00:00
kanidef 5fa9257704 add encoding windows 1251, cp855, cp866
Change-Id: I0e8507cf63d89942167ca579ef304bc3d679346e
Reviewed-on: https://code.wireshark.org/review/31316
Petri-Dish: Peter Wu <peter@lekensteyn.nl>
Tested-by: Petri Dish Buildbot
Reviewed-by: Guy Harris <guy@alum.mit.edu>
2019-01-04 23:37:17 +00:00
Dario Lombardo 55c68ee69c epan: use SPDX indentifiers.
Skipping dissectors dir for now.

Change-Id: I717b66bfbc7cc81b83f8c2cbc011fcad643796aa
Reviewed-on: https://code.wireshark.org/review/25694
Petri-Dish: Dario Lombardo <lomato@gmail.com>
Tested-by: Petri Dish Buildbot
Reviewed-by: Anders Broman <a.broman58@gmail.com>
2018-02-08 19:29:45 +00:00
Guy Harris b604fff136 Rename non-EBCDIC-specific routines.
Those routines can handle any single-byte character set whose characters
map to characters in the Basic Multilingual Plane; it could be used for
extended ASCII, but we have another routine for that, mapping only
characters with code points > 0x7f, so we just say "nonascii" rather
than "ebcdic".

Change-Id: I3d55b5d58e3e7ab08f3dfbfdb57a0301a30e71d4
Reviewed-on: https://code.wireshark.org/review/19214
Reviewed-by: Guy Harris <guy@alum.mit.edu>
2016-12-12 08:20:22 +00:00
Guy Harris 4d47c9a841 Fix handling of EBCDIC string fields.
Have a routine that takes a 256-element translation table and uses it to
map various flavors of EBCDIC to Unicode.  Have separate translation
tables for "common" EBCDIC (everything that's the same in all EBCDIC
code pages that include the original EBCDIC characters) and EBCDIC code
page 037.  Add ENC_EBCDIC_CP037 for code page 037.

Change-Id: Ia882b3c0abef9e30eb54cd47396e6fa0d6342044
Reviewed-on: https://code.wireshark.org/review/19212
Reviewed-by: Guy Harris <guy@alum.mit.edu>
2016-12-12 05:49:50 +00:00
Pascal Quantin 321b756dc4 Add T.61 character set support
Bug: 13032
Change-Id: I6bf2cc2c43a6262d899a304df6576d9831115966
Reviewed-on: https://code.wireshark.org/review/18350
Petri-Dish: Michael Mann <mmann78@netscape.net>
Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org>
Reviewed-by: Michael Mann <mmann78@netscape.net>
2016-10-22 03:16:11 +00:00
Bill Meier f3dd7fe1eb Fix whitespace/indentation to match editor modelines.
Change-Id: I3445ae22f10584582d465bf632942e016f5f70ca
Reviewed-on: https://code.wireshark.org/review/3452
Reviewed-by: Bill Meier <wmeier@newsguy.com>
2014-08-05 20:42:21 +00:00
Guy Harris 29eba5308f Add a get_ebcdic_string() routine, similar to other get_XXX_string() routines.
Use it in epan/tvbuff.c.

Do some other cleanups while we're at it.

Change-Id: I7aed37a568373b896aacfd23f986d445b58b77b7
Reviewed-on: https://code.wireshark.org/review/1342
Reviewed-by: Guy Harris <guy@alum.mit.edu>
2014-04-25 09:30:14 +00:00
Guy Harris 0d787afcb4 Another whitespace cleanup.
Change-Id: I7c5c557730fb59244bc82c35fcf79c40991d4d99
Reviewed-on: https://code.wireshark.org/review/1341
Reviewed-by: Guy Harris <guy@alum.mit.edu>
2014-04-25 08:44:36 +00:00
Guy Harris 6a9c924460 Move the XXX-to-UTF-8 loops to routines in epan/charsets.c.
This moves a bunch of character set knowledge into epan/charsets.c.

Change-Id: Ieb79dcaac9753c77703af756b666ad2ca9385d9e
Reviewed-on: https://code.wireshark.org/review/1339
Reviewed-by: Guy Harris <guy@alum.mit.edu>
2014-04-25 08:32:06 +00:00
Jakub Zawadzki 4bd8336017 Move GSM guint8 to unicode conversion functions to charsets.c
charsets.c is already place with huge number of conversion tables.
Also make gsm_default_alphabet gunichar2, all values fits in 2 bytes.

Change-Id: Ia5ab6c176b4fec21ec76b06513c1d00794ba10ef
Reviewed-on: https://code.wireshark.org/review/1328
Reviewed-by: Anders Broman <a.broman58@gmail.com>
2014-04-25 04:17:58 +00:00
Guy Harris ae127f23fa Add Mac Roman and DOS CP437.
Change-Id: Ib96f2cf4ea71cd0cc2c703d58b9d254bf4c1248a
Reviewed-on: https://code.wireshark.org/review/1077
Reviewed-by: Guy Harris <guy@alum.mit.edu>
2014-04-12 08:54:06 +00:00
Alexis La Goutte 296591399f Remove all $Id$ from top of file
(Using sed : sed -i '/^ \* \$Id\$/,+1 d')

Fix manually some typo (in export_object_dicom.c and crc16-plain.c)

Change-Id: I4c1ae68d1c4afeace8cb195b53c715cf9e1227a8
Reviewed-on: https://code.wireshark.org/review/497
Reviewed-by: Anders Broman <a.broman58@gmail.com>
2014-03-04 14:27:33 +00:00
Guy Harris f231a273f2 Add the rest of ISO-8859-n, thanks to Jakub's "generate a mapping table"
program.

Put the character-encoding cases in order.

svn path=/trunk/; revision=54344
2013-12-21 21:55:46 +00:00
Jakub Zawadzki 099294dd16 Add charset table for ISO/IEC 8859-9 (ENC_ISO_8859_9)
svn path=/trunk/; revision=54239
2013-12-18 23:32:06 +00:00
Martin Kaiser a07c0ff146 add support for ISO 8859-5
svn path=/trunk/; revision=54132
2013-12-15 19:13:31 +00:00
Martin Kaiser db1b70f168 as requested, move the functions/defines for DVB character tables
to separate files

svn path=/trunk/; revision=54113
2013-12-15 12:05:50 +00:00
Jakub Zawadzki d6da7a01b1 Fix warnings + remove some v. old comment from strutil.h
svn path=/trunk/; revision=54078
2013-12-13 23:11:14 +00:00
Martin Kaiser 20c7414c71 use large positve values for illegal DVB-SI string encodings
interpret encoding fields as UINT32 so that the displayed value matches
the actual bytes in the packet

svn path=/trunk/; revision=53927
2013-12-10 22:08:07 +00:00
Martin Kaiser 3dbf837040 add editor modelines
svn path=/trunk/; revision=53890
2013-12-09 20:58:57 +00:00
Martin Kaiser cb1cb946d3 From Jakub
support DVB-SI character tables (EN 300 468) in a generic way

From me
move things to charsets.c/.h
distinguish between single and multi byte encoding for some tables
(so that the highlighted bytes match the displayed value)
no character table byte -> length 0, use default table

svn path=/trunk/; revision=53886
2013-12-09 20:46:27 +00:00
Guy Harris 3c2bd00ccf Note what the two new character encoding tables in charsets.c are.
svn path=/trunk/; revision=53833
2013-12-07 22:45:37 +00:00
Jakub Zawadzki 0e5bc8a49c Add string encoding for ISO/IEC 8859-2 (ENC_ISO_8859_2)
svn path=/trunk/; revision=53826
2013-12-07 15:02:55 +00:00
Jakub Zawadzki 113b078a4d Add new string proto encoding for windows-1250 (ENC_WINDOWS_1250)
- Move windows-1250 to unicode encoding table to charset.c
- Add tvb_get_string_unichar2, tvb_get_stringz_unichar2 functions which recode tvb-string to UTF-8.

svn path=/trunk/; revision=53819
2013-12-07 10:10:03 +00:00
Balint Reczey 1ebdb2e521 Export libwireshark symbols using WS_DLL_PUBLIC define
Also remove old WS_VAR_IMPORT define and related Makefile magic
everywhere in the project.

svn path=/trunk/; revision=47992
2013-03-01 23:53:11 +00:00
Jakub Zawadzki bf81b42e1e Update Free Software Foundation address.
(COPYING will be updated in next commit)

svn path=/trunk/; revision=43536
2012-06-28 22:56:06 +00:00
Ronnie Sahlberg 89f022b12b name change
svn path=/trunk/; revision=18197
2006-05-21 05:12:17 +00:00
Guy Harris ac982aa7a5 Move the stuff to handle ASCII <-> EBCDIC conversions to
"epan/charsets.c"; other character set translation code should perhaps
go there as well.

svn path=/trunk/; revision=11958
2004-09-10 22:59:37 +00:00