aboutsummaryrefslogtreecommitdiffstats
path: root/epan/charsets.h
AgeCommit message (Collapse)AuthorFilesLines
2023-06-08epan: Add ENC_BOM modifier for UTF-16, UCS-2, UCS-4John Thacker1-7/+10
Add ENC_BOM to the list of bitflag modifiers, and use it with UTF-16, UCS-2, and UCS-4 (UTF-32). If set, this means that the first 2 (or 4) octets, if present, are checked to see if they are a Big-Endian BYTE ORDER MARK ("ZERO WIDTH NON-BREAKING SPACE"). If so, those octets are skipped and the encoding is set to Little-Endian or Big-Endian depending on endianness of the BOM. If the BOM is absent, the passed in Endianness flag is used normally. Related to #17991
2023-04-26epan: Implement EBCDIC CP 500, for DRDAJohn Thacker1-0/+1
EBCDIC Code Page 500 has exactly the same repertoire as CP 037, covering all of ISO-8859-1, but has 7 bytes permuted. It is the default code page for DRDA; use it there.
2022-12-21DECT-NWK: Add basic support for DECT charsetsBernhard Dick1-0/+2
2022-10-19charsets: Don't add illegal Unicode codepoints for UTF-16, UTF-32John Thacker1-13/+0
If a character is not a valid Unicode codepoint, i.e. one of the code points reserved for surrogate pairs or a code point above 0x10FFFF, don't add it to a wmem_strbuf when converting from other encodings but add a replacement character instead, by using a new wmem_strbuf_append_unichar_validated() function. Now we produce valid UTF-8 in various situations where UCS-2 or UTF-32 can encode unpaired surrogate codepoints. Consolidate some related checks that are now redundant. Also add a replacement character to the end of invalid UCS-2 strings with an odd number of bytes, as done with UTF-16 and UTF-32. Fix #18508
2022-10-11charsets: UCS-4 code points above 0x10FFFFF are not legalJohn Thacker1-3/+0
When decoding UCS-4/UTF-32, map Unicode code points above 0x10FFFFF to REPLACEMENT CHARACTER, as they are not legal, and would create invalid UTF-8. Also if the number of bytes given is not a multiple of 4, insert a replacement character at the end as well. This is two long standing todos. Fixes #18435.
2022-10-08Move print_hex_data_buffer() to wsutilJoão Valverde1-9/+0
Move this generic function to wsutil so it can be used by other libraries.
2021-11-29Add files with WS_DLL_PUBLIC to DoxygenMoshe Kaplan1-1/+1
Add @file markers for most files that contain functions exported with WS_DLL_PUBLIC so that Doxygen will generate documentation for them.
2020-10-21Use iconv to support GB 18030 and EUC-KR, allow future encodingsJohn Thacker1-0/+24
Add support internally to using iconv (always present with glib) to convert strings from various encodings to UTF-8 (using REPLACEMENT CHARACTER as recommended), and use that to support GB 18030 and EUC-KR. Replace call directly to iconv in ANSI 637 for EUC-KR to new API. Update comments and documentation around character encodings. It is possible to replace the calls to iconv with an internal decoder later. Tested on Linux and on Windows (including with illegal characters). Closes #16630.
2020-10-15Replace ill-formed UTF-8 byte sequences with replacement characterJohn Thacker1-0/+11
Implement the Unicode Standard "best practices" for replacing ill-formed sequences with the Unicode REPLACEMENT CHARACTER. Add wmem_strbuf_append_len for appending strings with embedded null characters. Clarify why wmem_strbuf_grow() doesn't always ensure that there's enough room for a new string, and short-circuit some tests there. Related to #14948
2020-09-28Add some more string encodings.Guy Harris1-1/+9
Add an encoding for "unpacked" 3GPP TS 23.038 7-bit strings, in which each code position is in a byte of its own, rather than with the code positions packed into 7 bits. Rename the packed encoding to explicitly indicate that it's packed. Add an encoding for ETSI TS 102 221 Annex A strings. Use the new encodings.
2019-07-26HTTPS (almost) everywhere.Guy Harris1-1/+1
Change all wireshark.org URLs to use https. Fix some broken links while we're at it. Change-Id: I161bf8eeca43b8027605acea666032da86f5ea1c Reviewed-on: https://code.wireshark.org/review/34089 Reviewed-by: Guy Harris <guy@alum.mit.edu>
2019-07-15Add support for the ISO 646 "Basic code table" encoding.Guy Harris1-0/+88
The "Basic code table" in ISO 646 is mostly ASCII, but some code points either 1) have more than one glyph that can be assigned to them or 2) have no glyph assigned to them. National versions choose one of the two glyphs for the code points in group 1) and assign specific glyphs to the code points in group 2); the International Reference Version assigns the same glyphs to those code points as does ASCII. For the "Basic code table" encoding, we map the code points in groups 1) and 2) to a REPLACEMENT CHARACTER; additional encodings can be added for the national versions. Add ENC_ISO_646_IRV (International Reference Version) as an alias for ENC_ASCII. Expand some comments, and add some comments, while we're at it. Change-Id: I4f1b5e426ec193775e919731c5cae1224dc65115 Reviewed-on: https://code.wireshark.org/review/33941 Petri-Dish: Guy Harris <guy@alum.mit.edu> Tested-by: Petri Dish Buildbot Reviewed-by: Guy Harris <guy@alum.mit.edu>
2019-05-25Add Windows code page 1252.Guy Harris1-0/+2
While we're at it, add the Euro to code page 1251, expand the comments for 1250 and 1251 and some DOS code pages, and add support for code page 1251 to tvb_get_stringz_enc(). Change-Id: I053d58f87cac26ad7c109e2f1cd8807ffec0622d Reviewed-on: https://code.wireshark.org/review/33342 Petri-Dish: Guy Harris <guy@alum.mit.edu> Tested-by: Petri Dish Buildbot Reviewed-by: Guy Harris <guy@alum.mit.edu>
2019-01-04add encoding windows 1251, cp855, cp866kanidef1-0/+5
Change-Id: I0e8507cf63d89942167ca579ef304bc3d679346e Reviewed-on: https://code.wireshark.org/review/31316 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot Reviewed-by: Guy Harris <guy@alum.mit.edu>
2018-02-08epan: use SPDX indentifiers.Dario Lombardo1-13/+1
Skipping dissectors dir for now. Change-Id: I717b66bfbc7cc81b83f8c2cbc011fcad643796aa Reviewed-on: https://code.wireshark.org/review/25694 Petri-Dish: Dario Lombardo <lomato@gmail.com> Tested-by: Petri Dish Buildbot Reviewed-by: Anders Broman <a.broman58@gmail.com>
2016-12-12Rename non-EBCDIC-specific routines.Guy Harris1-1/+1
Those routines can handle any single-byte character set whose characters map to characters in the Basic Multilingual Plane; it could be used for extended ASCII, but we have another routine for that, mapping only characters with code points > 0x7f, so we just say "nonascii" rather than "ebcdic". Change-Id: I3d55b5d58e3e7ab08f3dfbfdb57a0301a30e71d4 Reviewed-on: https://code.wireshark.org/review/19214 Reviewed-by: Guy Harris <guy@alum.mit.edu>
2016-12-12Fix handling of EBCDIC string fields.Guy Harris1-1/+5
Have a routine that takes a 256-element translation table and uses it to map various flavors of EBCDIC to Unicode. Have separate translation tables for "common" EBCDIC (everything that's the same in all EBCDIC code pages that include the original EBCDIC characters) and EBCDIC code page 037. Add ENC_EBCDIC_CP037 for code page 037. Change-Id: Ia882b3c0abef9e30eb54cd47396e6fa0d6342044 Reviewed-on: https://code.wireshark.org/review/19212 Reviewed-by: Guy Harris <guy@alum.mit.edu>
2016-10-22Add T.61 character set supportPascal Quantin1-0/+3
Bug: 13032 Change-Id: I6bf2cc2c43a6262d899a304df6576d9831115966 Reviewed-on: https://code.wireshark.org/review/18350 Petri-Dish: Michael Mann <mmann78@netscape.net> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Michael Mann <mmann78@netscape.net>
2014-08-05Fix whitespace/indentation to match editor modelines.Bill Meier1-2/+2
Change-Id: I3445ae22f10584582d465bf632942e016f5f70ca Reviewed-on: https://code.wireshark.org/review/3452 Reviewed-by: Bill Meier <wmeier@newsguy.com>
2014-04-25Add a get_ebcdic_string() routine, similar to other get_XXX_string() routines.Guy Harris1-9/+12
Use it in epan/tvbuff.c. Do some other cleanups while we're at it. Change-Id: I7aed37a568373b896aacfd23f986d445b58b77b7 Reviewed-on: https://code.wireshark.org/review/1342 Reviewed-by: Guy Harris <guy@alum.mit.edu>
2014-04-25Another whitespace cleanup.Guy Harris1-1/+1
Change-Id: I7c5c557730fb59244bc82c35fcf79c40991d4d99 Reviewed-on: https://code.wireshark.org/review/1341 Reviewed-by: Guy Harris <guy@alum.mit.edu>
2014-04-25Move the XXX-to-UTF-8 loops to routines in epan/charsets.c.Guy Harris1-6/+35
This moves a bunch of character set knowledge into epan/charsets.c. Change-Id: Ieb79dcaac9753c77703af756b666ad2ca9385d9e Reviewed-on: https://code.wireshark.org/review/1339 Reviewed-by: Guy Harris <guy@alum.mit.edu>
2014-04-25Move GSM guint8 to unicode conversion functions to charsets.cJakub Zawadzki1-0/+6
charsets.c is already place with huge number of conversion tables. Also make gsm_default_alphabet gunichar2, all values fits in 2 bytes. Change-Id: Ia5ab6c176b4fec21ec76b06513c1d00794ba10ef Reviewed-on: https://code.wireshark.org/review/1328 Reviewed-by: Anders Broman <a.broman58@gmail.com>
2014-04-12Add Mac Roman and DOS CP437.Guy Harris1-0/+6
Change-Id: Ib96f2cf4ea71cd0cc2c703d58b9d254bf4c1248a Reviewed-on: https://code.wireshark.org/review/1077 Reviewed-by: Guy Harris <guy@alum.mit.edu>
2014-03-04Remove all $Id$ from top of fileAlexis La Goutte1-2/+0
(Using sed : sed -i '/^ \* \$Id\$/,+1 d') Fix manually some typo (in export_object_dicom.c and crc16-plain.c) Change-Id: I4c1ae68d1c4afeace8cb195b53c715cf9e1227a8 Reviewed-on: https://code.wireshark.org/review/497 Reviewed-by: Anders Broman <a.broman58@gmail.com>
2013-12-21Add the rest of ISO-8859-n, thanks to Jakub's "generate a mapping table"Guy Harris1-0/+11
program. Put the character-encoding cases in order. svn path=/trunk/; revision=54344
2013-12-18Add charset table for ISO/IEC 8859-9 (ENC_ISO_8859_9)Jakub Zawadzki1-0/+1
svn path=/trunk/; revision=54239
2013-12-15add support for ISO 8859-5Martin Kaiser1-0/+1
svn path=/trunk/; revision=54132
2013-12-15as requested, move the functions/defines for DVB character tablesMartin Kaiser1-52/+0
to separate files svn path=/trunk/; revision=54113
2013-12-13Fix warnings + remove some v. old comment from strutil.hJakub Zawadzki1-3/+3
svn path=/trunk/; revision=54078
2013-12-10use large positve values for illegal DVB-SI string encodingsMartin Kaiser1-4/+4
interpret encoding fields as UINT32 so that the displayed value matches the actual bytes in the packet svn path=/trunk/; revision=53927
2013-12-09add editor modelinesMartin Kaiser1-0/+13
svn path=/trunk/; revision=53890
2013-12-09From JakubMartin Kaiser1-0/+52
support DVB-SI character tables (EN 300 468) in a generic way From me move things to charsets.c/.h distinguish between single and multi byte encoding for some tables (so that the highlighted bytes match the displayed value) no character table byte -> length 0, use default table svn path=/trunk/; revision=53886
2013-12-07Note what the two new character encoding tables in charsets.c are.Guy Harris1-0/+6
svn path=/trunk/; revision=53833
2013-12-07Add string encoding for ISO/IEC 8859-2 (ENC_ISO_8859_2)Jakub Zawadzki1-0/+3
svn path=/trunk/; revision=53826
2013-12-07Add new string proto encoding for windows-1250 (ENC_WINDOWS_1250)Jakub Zawadzki1-0/+3
- Move windows-1250 to unicode encoding table to charset.c - Add tvb_get_string_unichar2, tvb_get_stringz_unichar2 functions which recode tvb-string to UTF-8. svn path=/trunk/; revision=53819
2013-03-01Export libwireshark symbols using WS_DLL_PUBLIC defineBalint Reczey1-0/+4
Also remove old WS_VAR_IMPORT define and related Makefile magic everywhere in the project. svn path=/trunk/; revision=47992
2012-06-28Update Free Software Foundation address.Jakub Zawadzki1-1/+1
(COPYING will be updated in next commit) svn path=/trunk/; revision=43536
2006-05-21name changeRonnie Sahlberg1-2/+2
svn path=/trunk/; revision=18197
2004-09-10Move the stuff to handle ASCII <-> EBCDIC conversions toGuy Harris1-0/+42
"epan/charsets.c"; other character set translation code should perhaps go there as well. svn path=/trunk/; revision=11958