text2pcap: add regex

Add support in text2pcap for the regex mode added to "Import from
Hex Dump" in 3.6.0 The input and output indicators cannot (yet?)
be configured, and are set to the default of allowing any of "iI<"
for inbound and "oO>" for outbound. This reaches feature parity
between text2pcap and Import from Hex Dump, fixes #16724.
(There might be some more cleanups to do, including docs.)
This commit is contained in:
John Thacker 2021-12-30 18:11:15 -05:00
parent 6cdb86fbc7
commit ab347ea14e
2 changed files with 192 additions and 28 deletions

View File

@ -14,6 +14,7 @@ text2pcap - Generate a capture file from an ASCII hexdump of packets
[manarg]
*text2pcap*
[ *-a* ]
[ *-b* 2|8|16|64 ]
[ *-D* ]
[ *-e* <l3pid> ]
[ *-h* ]
@ -24,6 +25,7 @@ text2pcap - Generate a capture file from an ASCII hexdump of packets
[ *-m* <max-packet> ]
[ *-o* hex|oct|dec|none ]
[ *-q* ]
[ *-r* <regex> ]
[ *-s* <srcport>,<destport>,<tag> ]
[ *-S* <srcport>,<destport>,<ppi> ]
[ *-t* <timefmt> ]
@ -97,12 +99,57 @@ future, these may be used to give more fine grained control on the
dump and the way it should be processed e.g. timestamps, encapsulation
type etc.
*Text2pcap* is also capable of scanning a text input file using a custom Perl
compatible regular expression that matches a single packet. *text2pcap*
searches the given file (which must end with '\n') for non-overlapping non-empty
strings matching the regex. Named capturing subgroups, which must match
exactly once per packet, are used to identify fields to import. The following
fields are supported in regex mode, one mandatory and three optional:
"data" Actual captured frame data to import
"time" Timestamp of packet
"dir" Direction of packet
"seqno" Arbitrary ID of packet
The 'data' field is the captured data, which must be in a selected encoding:
hexadecimal (the default), octal, binary, or base64 and containing no
characters in the data field outside the encoding set besides whitespace.
The 'time' field is parsed according to the format in the *-t* parameter.
The first character of the 'dir' field is compared against a set of characters
corresponding to inbound and outbound that default to "iI<" for inbound and
"oO>" for outbound to assign a direction. The 'seqno' field is assumed to
be a positive integer base 10 used for an arbitrary ID. An optional field's
information will only be written if the field is present in the regex and if
the capture file format supports it. (E.g., the pcapng format supports all
three fields, but the pcap format only supports timestamps.)
Here is a sample dump that the regex mode can process with the regex
'^(?<dir>[<>])\s(?<time>\d+:\d\d:\d\d.\d+)\s(?<data>[0-9a-fA-F]+)$' along
with timestamp format '%H:%M:%S.%f', directional indications of '<' and '>',
and hex encoding:
> 0:00:00.265620 a130368b000000080060
> 0:00:00.280836 a1216c8b00000000000089086b0b82020407
< 0:00:00.295459 a2010800000000000000000800000000
> 0:00:00.296982 a1303c8b00000008007088286b0bc1ffcbf0f9ff
> 0:00:00.305644 a121718b0000000000008ba86a0b8008
< 0:00:00.319061 a2010900000000000000001000600000
> 0:00:00.330937 a130428b00000008007589186b0bb9ffd9f0fdfa3eb4295e99f3aaffd2f005
> 0:00:00.356037 a121788b0000000000008a18
The regex is compiled with multiline support, and it is recommended to use
the anchors '^' and '$' for best results.
*Text2pcap* also allows the user to read in dumps of
application-level data, by inserting dummy L2, L3 and L4 headers
before each packet. The user can elect to insert Ethernet headers,
Ethernet and IP, or Ethernet, IP and UDP/TCP/SCTP headers before each
packet. This allows Wireshark or any other full-packet decoder to
handle these dumps.
handle these dumps. These encapsulation options can be used in both
hexdump mode and regex mode.
When <__infile__> or <__outfile__> are '-', standard input or standard
output, respectively, are used.
== OPTIONS
@ -111,16 +158,28 @@ handle these dumps.
--
Enables ASCII text dump identification. It allows one to identify the start of
the ASCII text dump and not include it in the packet even if it looks like HEX.
This parameter has no effect in regex mode.
*NOTE:* Do not enable it if the input file does not contain the ASCII text dump.
--
-b 2|8|16|64::
+
--
Specify the base (radix) of the encoding of the packet data in regex mode.
The supported options are 2 (binary), 8 (octal), 16 (hexadecimal), and 64
(base64 encoding), with hex as the default. This parameter has no effect
in hexdump mode.
--
-D::
+
--
The text before the packet may start either with an I or O indicating that
the packet is inbound or outbound. This is used when generating dummy headers.
The indication is only stored if the output format is pcapng.
The indication is only stored if the output format supports it (e.g. pcapng.)
This parameter has no effect in regex mode, where the presence of the `<dir>`
capturing group determines whether direction indicators are expected.
--
-e <l3pid>::
@ -191,14 +250,15 @@ Write the file in pcapng format rather than pcap format.
+
--
Specify a name for the interface included when writing a pcapng format
file. By default no name is defined.
file.
--
-o hex|oct|dec|none::
+
--
Specify the radix for the offsets (hex, octal, decimal, or none). Defaults to
hex. This corresponds to the `-A` option for __od__.
hex. This corresponds to the `-A` option for __od__. This parameter has no
effect in regex mode.
*NOTE:* With __-o none__, only one packet will be created, ignoring any
direction indicators or timestamps after the first byte along with any offsets.
@ -223,6 +283,15 @@ Don't display the summary of the options selected at the beginning,
or the count of packets processed at the end.
--
-r <regex>::
+
--
Process the file in regex mode using __regex__ as described above.
*NOTE:* The regex mode uses memory-mapped I/O and does not work on
streams that do not support seeking, like terminals and pipes.
--
-s <srcport>,<destport>,<tag>::
+
--
@ -252,10 +321,11 @@ into the SCTP header.
--
Treats the text before the packet as a date/time code; __timefmt__ is a
format string supported by strftime(3), supplemented with the field
descriptor "%f" for fractional seconds up to nanoseconds.
descriptor '%f' for fractional seconds up to nanoseconds.
Example: The time "10:15:14.5476" has the format code "%H:%M:%S.%f"
The special format string __ISO__ indicates that the string should be
parsed according to the ISO-8601 specification.
parsed according to the ISO-8601 specification. This parameter is used
in regex mode if and only if the `<time>` capturing group is present.
*NOTE:* Date/time fields from the current date/time are
used as the default for unspecified fields.

View File

@ -90,7 +90,6 @@
#include <wsutil/ws_getopt.h>
#include <errno.h>
#include <assert.h>
#include "text2pcap.h"
@ -150,15 +149,12 @@ static guint32 hdr_data_chunk_ppid = 0;
/* Export PDU */
static gboolean hdr_export_pdu = FALSE;
static gboolean has_direction = FALSE;
/*--- Local data -----------------------------------------------------------------*/
/* This is where we store the packet currently being built */
static guint32 max_offset = WTAP_MAX_PACKET_SIZE_STANDARD;
/* Time code of packet, derived from packet_preamble */
static char *ts_fmt = NULL;
static int ts_fmt_iso = 0;
/* Input file */
@ -198,14 +194,25 @@ print_usage (FILE *output)
" used as the default for unspecified fields.\n"
" -D the text before the packet starts with an I or an O,\n"
" indicating that the packet is inbound or outbound.\n"
" This is used when generating dummy headers.\n"
" The indication is only stored if the output format is pcapng.\n"
" This is used when generating dummy headers if the\n"
" output format supports it (e.g. pcapng).\n"
" -a enable ASCII text dump identification.\n"
" The start of the ASCII text dump can be identified\n"
" and excluded from the packet data, even if it looks\n"
" like a HEX dump.\n"
" NOTE: Do not enable it if the input file does not\n"
" contain the ASCII text dump.\n"
" -r <regex> enable regex mode. Scan the input using <regex>, a Perl\n"
" compatible regular expression matching a single packet.\n"
" Named capturing subgroups are used to identify fields:\n"
" <data> (mand.), and <time>, <dir>, and <seqno> (opt.)\n"
" The time field format is taken from the -t option\n"
" Example: -r '^(?<dir>[<>])\\s(?<time>\\d+:\\d\\d:\\d\\d.\\d+)\\s(?<data>[0-9a-fA-F]+)$'\n"
" could match a file with lines like\n"
" > 0:00:00.265620 a130368b000000080060\n"
" < 0:00:00.295459 a2010800000000000000000800000000\n"
" -b 2|8|16|64 encoding base (radix) of the packet data in regex mode\n"
" (def: 16: hexadecimal) No effect in hexdump mode.\n"
"\n"
"Output:\n"
" -l <typenum> link-layer type number; default is 1 (Ethernet). See\n"
@ -307,27 +314,53 @@ parse_options(int argc, char *argv[], text_import_info_t * const info, wtap_dump
int file_type_subtype;
int err;
char* err_info;
GError* gerror = NULL;
GRegex* regex = NULL;
info->mode = TEXT_IMPORT_HEXDUMP;
info->hexdump.offset_type = OFFSET_HEX;
info->regex.encoding = ENCODING_PLAIN_HEX;
info->payload = "data";
/* Initialize the version information. */
ws_init_version_info("Text2pcap (Wireshark)", NULL, NULL, NULL);
/* Scan CLI parameters */
while ((c = ws_getopt_long(argc, argv, "aDhqe:i:l:m:nN:o:u:P:s:S:t:T:v4:6:", long_options, NULL)) != -1) {
while ((c = ws_getopt_long(argc, argv, "hqab:De:i:l:m:nN:o:u:P:r:s:S:t:T:v4:6:", long_options, NULL)) != -1) {
switch (c) {
case 'h':
show_help_header("Generate a capture file from an ASCII hexdump of packets.");
print_usage(stdout);
exit(0);
break;
case 'D': has_direction = TRUE; break;
case 'q': quiet = TRUE; break;
case 'a': info->hexdump.identify_ascii = TRUE; break;
case 'D': info->hexdump.has_direction = TRUE; break;
case 'l': pcap_link_type = (guint32)strtol(ws_optarg, NULL, 0); break;
case 'm': max_offset = (guint32)strtol(ws_optarg, NULL, 0); break;
case 'n': use_pcapng = TRUE; break;
case 'N': interface_name = ws_optarg; break;
case 'b':
{
guint8 radix;
if (!ws_strtou8(ws_optarg, NULL, &radix)) {
cmdarg_err("Bad argument for '-b': %s", ws_optarg);
print_usage(stderr);
return INVALID_OPTION;
}
switch (radix) {
case 2: info->regex.encoding = ENCODING_PLAIN_BIN; break;
case 8: info->regex.encoding = ENCODING_PLAIN_OCT; break;
case 16: info->regex.encoding = ENCODING_PLAIN_HEX; break;
case 64: info->regex.encoding = ENCODING_BASE64; break;
default:
cmdarg_err("Bad argument for '-b': %s", ws_optarg);
print_usage(stderr);
return INVALID_OPTION;
}
break;
}
case 'o':
if (ws_optarg[0] != 'h' && ws_optarg[0] != 'o' && ws_optarg[0] != 'd' && ws_optarg[0] != 'n') {
cmdarg_err("Bad argument for '-o': %s", ws_optarg);
@ -341,6 +374,7 @@ parse_options(int argc, char *argv[], text_import_info_t * const info, wtap_dump
case 'n': info->hexdump.offset_type = OFFSET_NONE; break;
}
break;
case 'e':
hdr_ethernet = TRUE;
if (sscanf(ws_optarg, "%x", &hdr_ethernet_proto) < 1) {
@ -368,6 +402,28 @@ parse_options(int argc, char *argv[], text_import_info_t * const info, wtap_dump
info->payload = ws_optarg;
break;
case 'r':
info->mode = TEXT_IMPORT_REGEX;
if (regex != NULL) {
/* XXX: Used the option twice. Should we warn? */
g_regex_unref(regex);
}
regex = g_regex_new(ws_optarg, G_REGEX_DUPNAMES | G_REGEX_OPTIMIZE | G_REGEX_MULTILINE, G_REGEX_MATCH_NOTEMPTY, &gerror);
if (gerror) {
cmdarg_err("%s", gerror->message);
g_error_free(gerror);
print_usage(stderr);
return INVALID_OPTION;
} else {
if (g_regex_get_string_number(regex, "data") == -1) {
cmdarg_err("Regex missing capturing group data (use (?<data>(...)) )");
g_regex_unref(regex);
print_usage(stderr);
return INVALID_OPTION;
}
}
break;
case 's':
hdr_sctp = TRUE;
hdr_data_chunk = FALSE;
@ -408,6 +464,7 @@ parse_options(int argc, char *argv[], text_import_info_t * const info, wtap_dump
set_hdr_ip_proto(132);
break;
case 'S':
hdr_sctp = TRUE;
hdr_data_chunk = TRUE;
@ -450,7 +507,7 @@ parse_options(int argc, char *argv[], text_import_info_t * const info, wtap_dump
break;
case 't':
ts_fmt = ws_optarg;
info->timestamp_format = ws_optarg;
if (!strcmp(ws_optarg, "ISO"))
ts_fmt_iso = 1;
break;
@ -509,10 +566,6 @@ parse_options(int argc, char *argv[], text_import_info_t * const info, wtap_dump
set_hdr_ip_proto(6);
break;
case 'a':
info->hexdump.identify_ascii = TRUE;
break;
case 'v':
show_version();
exit(0);
@ -598,6 +651,20 @@ parse_options(int argc, char *argv[], text_import_info_t * const info, wtap_dump
}
/* Some validation */
if (info->mode == TEXT_IMPORT_REGEX) {
info->regex.format = regex;
/* need option for data encoding */
if (g_regex_get_string_number(regex, "dir") > -1) {
/* XXX: Add parameter(s?) to specify these? */
info->regex.in_indication = "iI<";
info->regex.out_indication = "oO>";
}
if (g_regex_get_string_number(regex, "time") > -1 && info->timestamp_format == NULL) {
cmdarg_err("Regex with <time> capturing group requires time format (-t)");
return INVALID_OPTION;
}
}
if (pcap_link_type != 1 && hdr_ethernet) {
cmdarg_err("Dummy headers (-e, -i, -u, -s, -S -T) cannot be specified with link type override (-l)");
return INVALID_OPTION;
@ -640,12 +707,36 @@ parse_options(int argc, char *argv[], text_import_info_t * const info, wtap_dump
if (strcmp(argv[ws_optind], "-") != 0) {
input_filename = argv[ws_optind];
input_file = ws_fopen(input_filename, "rb");
if (!input_file) {
open_failure_message(input_filename, errno, FALSE);
return OPEN_ERROR;
if (info->mode == TEXT_IMPORT_REGEX) {
info->regex.import_text_GMappedFile = g_mapped_file_new(input_filename, TRUE, &gerror);
if (gerror) {
cmdarg_err("%s", gerror->message);
g_error_free(gerror);
return OPEN_ERROR;
}
} else {
input_file = ws_fopen(input_filename, "rb");
if (!input_file) {
open_failure_message(input_filename, errno, FALSE);
return OPEN_ERROR;
}
}
} else {
if (info->mode == TEXT_IMPORT_REGEX) {
/* text_import_regex requires a memory mapped file, so this likely
* won't work, unless the user has redirected a file (not a FIFO)
* to stdin, though that's pretty silly and unnecessary.
* XXX: We could read until EOF, write it to a temp file, and then
* mmap that (ugh)?
*/
info->regex.import_text_GMappedFile = g_mapped_file_new_from_fd(0, TRUE, &gerror);
if (gerror) {
cmdarg_err("%s", gerror->message);
cmdarg_err("regex import requires memory-mapped I/O and cannot be used with terminals or pipes");
g_error_free(gerror);
return INVALID_OPTION;
}
}
input_filename = "Standard input";
input_file = stdin;
}
@ -686,12 +777,9 @@ parse_options(int argc, char *argv[], text_import_info_t * const info, wtap_dump
return OPEN_ERROR;
}
info->mode = TEXT_IMPORT_HEXDUMP;
info->import_text_filename = input_filename;
info->output_filename = output_filename;
info->hexdump.import_text_FILE = input_file;
info->hexdump.has_direction = has_direction;
info->timestamp_format = ts_fmt;
info->encapsulation = wtap_encap_type;
info->wdh = wdh;
@ -825,8 +913,8 @@ main(int argc, char *argv[])
goto clean_exit;
}
assert(input_file != NULL);
assert(wdh != NULL);
ws_assert(input_file != NULL || info.regex.import_text_GMappedFile != NULL);
ws_assert(wdh != NULL);
ret = text_import(&info);
@ -843,6 +931,12 @@ clean_exit:
if (input_file) {
fclose(input_file);
}
if (info.regex.import_text_GMappedFile) {
g_mapped_file_unref(info.regex.import_text_GMappedFile);
}
if (info.regex.format) {
g_regex_unref(info.regex.format);
}
if (wdh) {
int err;
char *err_info;