Extends raw adressing syntax to wok with references. The syntax
is
@field1 == ${@field2}
This requires replicating the logic to load field references, but
using raw values instead. We use separate hash tables for that,
namely "references" vs "raw_references".
This adds new syntax to read a field from the tree as bytes, instead
of the actual type. This is a useful extension for example to match
matformed strings that contain unicode replacement characters. In
this case it is not possible to match the raw value of the malformed
string field. This extension fills this need and is generic enough
that it should be useful in many other situations.
The syntax used is to prefix the field name with "@". The following
artificial example tests if the HTTP user agent contains a particular
invalid UTF-8 sequence:
@http.user_agent == "Mozill\xAA"
Where simply using "http.user_agent" won't work because the invalid byte
sequence will have been replaced with U+FFFD.
Considering the following programs:
$ dftest '_ws.ftypes.string == "ABC"'
Filter: _ws.ftypes.string == "ABC"
Syntax tree:
0 TEST_ANY_EQ:
1 FIELD(_ws.ftypes.string <FT_STRING>)
1 FVALUE("ABC" <FT_STRING>)
Instructions:
00000 READ_TREE _ws.ftypes.string <FT_STRING> -> reg#0
00001 IF_FALSE_GOTO 3
00002 ANY_EQ reg#0 == "ABC" <FT_STRING>
00003 RETURN
$ dftest '@_ws.ftypes.string == "ABC"'
Filter: @_ws.ftypes.string == "ABC"
Syntax tree:
0 TEST_ANY_EQ:
1 FIELD(_ws.ftypes.string <RAW>)
1 FVALUE(41:42:43 <FT_BYTES>)
Instructions:
00000 READ_TREE @_ws.ftypes.string <FT_BYTES> -> reg#0
00001 IF_FALSE_GOTO 3
00002 ANY_EQ reg#0 == 41:42:43 <FT_BYTES>
00003 RETURN
In the second case the field has a "raw" type, that equates directly to
FT_BYTES, and the field value is read from the protocol raw data.
This removes unparsed name resolution during the semantic
check because it feels like a hack to work around limitations
in the language syntax, that should be solved at the lexical
level instead.
We were interpreting unparsed differently on the LHS and RHS.
Now an unparsed value is always a field if it matches a
registered field name (this matches the implementation in 3.6
and before).
This requires tightening a bit the allowed filter names for
protocols to avoid some common and potentially weird conflicting
cases.
Incidentally this extends set grammar to accept all entities.
That is experimental and may be reverted in the future.
This adds support for using the layers filter
with field references.
Before:
$ dftest 'ip.src != ${ip.src#2}'
dftest: invalid character in macro name
After:
$ dftest 'ip.src != ${ip.src#2}'
Filter: ip.src != ${ip.src#2}
Syntax tree:
0 TEST_ALL_NE:
1 FIELD(ip.src <FT_IPv4>)
1 REFERENCE(ip.src#[2:1] <FT_IPv4>)
Instructions:
00000 READ_TREE ip.src <FT_IPv4> -> reg#0
00001 IF_FALSE_GOTO 5
00002 READ_REFERENCE_R ${ip.src <FT_IPv4>} #[2:1] -> reg#1
00003 IF_FALSE_GOTO 5
00004 ALL_NE reg#0 != reg#1
00005 RETURN
This requires adding another level of complexity to references.
When loading references we need to copy the 'proto_layer_num'
and add the logic to filter on that.
The "layer" sttype is removed and replace by a new
field sttype with support for a range. This is a nice
cleanup for the semantic check and general simplification.
The grammar is better too with this design.
Range sttype is renamed to slice for clarity.
Add support to display filters for matching a specific layer within a frame.
Layers are counted sequentially up the protocol stack. Each protocol
(dissector) that appears in the stack is one layer.
LINK-LAYER#1 <-> IP#1 <-> TCP#1 <-> IP#2 <-> TCP#2 <-> etc.
The syntax allows for negative indexes and ranges with the usual semantics
for slices (but note that counting starts at one):
tcp.port#[2-4] == 1024
Matches layers 2 to 4 inclusive.
Fixes#3791.
The word range is used for different things with different
meanings and that is confusing. Avoid using "range" in code to
mean "slice".
A range is one or more intervals with a lower and upper bound.
A slice is a range applied to a bytes field.
Replace range with slice wherever appropriate. This usage of
"slice" instead of range is generally correct and consistent in
the documentation.
This allows writing moderately complex expressions, for example
a float epsilon test (#16483):
Filter: {abs(_ws.ftypes.double - 1) / max(abs(_ws.ftypes.double), abs(1))} < 0.01
Syntax tree:
0 TEST_LT:
1 OP_DIVIDE:
2 FUNCTION(abs#1):
3 OP_SUBTRACT:
4 FIELD(_ws.ftypes.double)
4 FVALUE(1 <FT_DOUBLE>)
2 FUNCTION(max#2):
3 FUNCTION(abs#1):
4 FIELD(_ws.ftypes.double)
3 FUNCTION(abs#1):
4 FVALUE(1 <FT_DOUBLE>)
1 FVALUE(0.01 <FT_DOUBLE>)
Instructions:
00000 READ_TREE _ws.ftypes.double -> reg#1
00001 IF_FALSE_GOTO 3
00002 SUBRACT reg#1 - 1 <FT_DOUBLE> -> reg#2
00003 STACK_PUSH reg#2
00004 CALL_FUNCTION abs(reg#2) -> reg#0
00005 STACK_POP 1
00006 IF_FALSE_GOTO 24
00007 READ_TREE _ws.ftypes.double -> reg#1
00008 IF_FALSE_GOTO 9
00009 STACK_PUSH reg#1
00010 CALL_FUNCTION abs(reg#1) -> reg#4
00011 STACK_POP 1
00012 IF_FALSE_GOTO 13
00013 STACK_PUSH reg#4
00014 STACK_PUSH 1 <FT_DOUBLE>
00015 CALL_FUNCTION abs(1 <FT_DOUBLE>) -> reg#5
00016 STACK_POP 1
00017 IF_FALSE_GOTO 18
00018 STACK_PUSH reg#5
00019 CALL_FUNCTION max(reg#5, reg#4) -> reg#3
00020 STACK_POP 2
00021 IF_FALSE_GOTO 24
00022 DIVIDE reg#0 / reg#3 -> reg#6
00023 ANY_LT reg#6 < 0.01 <FT_DOUBLE>
00024 RETURN
We now use a stack to pass arguments to the function. The
stack is implemented as a list of lists (list of registers).
Arguments may still be non-existent to functions (this is
a feature). Functions must check for nil arguments (NULL lists)
and handle that case.
It's somewhat complicated to allow literal values and test compatibility
for different types, both because of lack of type information with
unparsed/literal and also because it is an underdeveloped area in the
code. In my limited testing it was good enough and useful, further
enhancements are left for future work.
Add location tracking as a column offset and length from offset
to the scanner. Our input is a single line only so we don't need
to track line offset.
Record that information in the syntax tree. Return the error location
in dfilter_compile(). Use it in dftest to mark the location of the
error in the filter string. Later it would be nice to use the location
in the GUI as well.
$ dftest "ip.proto == aaaaaa and tcp.port == 123"
Filter: ip.proto == aaaaaa and tcp.port == 123
dftest: "aaaaaa" cannot be found among the possible values for ip.proto.
ip.proto == aaaaaa and tcp.port == 123
^~~~~~
Revert to passing a syntax node from the lexical scanner to the grammar
parser. Using a union is not having a discernible advantage and requires
duplicating a lot of properties of syntax nodes.
This removes the limitation of having only two terms in an
arithmetic expression and allows setting the precedence using
curly braces (like any basic calculator).
Our grammar currently does not allow grouping arithmetic expressions
using parenthesis, because boolean expressions and arithmetic
expressions are different and parenthesis are used with the former.
After some experimentation I don't think these two existence tests
belong in the grammar, it's an implementation detail and removing it
might avoid some artificial constraints.
In most, if not all, programming languages logical AND has
higher precedence than logical OR. Apply the principle of
least surprise and do the same for Wireshark display
filters.
Before: ip and tcp or udp => ip and (tcp or udp)
Filter: ip and tcp or udp
Instructions:
00000 CHECK_EXISTS ip
00001 IF_FALSE_GOTO 5
00002 CHECK_EXISTS tcp
00003 IF_TRUE_GOTO 5
00004 CHECK_EXISTS udp
00005 RETURN
After: ip and tcp or udp => (ip and tcp) or udp
Filter: ip and tcp or udp
Instructions:
00000 CHECK_EXISTS ip
00001 IF_FALSE_GOTO 4
00002 CHECK_EXISTS tcp
00003 IF_TRUE_GOTO 5
00004 CHECK_EXISTS udp
00005 RETURN
Add support for display filter binary addition and subtraction.
The grammar is intentionally kept simple for now. The use case
is to add a constant to a protocol field, or (maybe) add two
fields in an expression.
We use signed arithmetic with unsigned numbers, checking for
overflow and casting where necessary to do the conversion.
We could legitimately opt to use traditional modular arithmetic
instead (like C) and if it turns out that that is more useful for
some reason we may want to in the future.
Fixes#15504.
This replaces the current macro reference system with
a completely different implementation. Instead of a macro a reference
is a syntax element. A reference is a constant that can be filled
in the dfilter code after compilation from an existing protocol tree.
It is best understood as a field value that can be read from a fixed
tree that is not the frame being filtered. Usually this fixed tree
is the currently selected frame when the filter is applied. This
allows comparing fields in the filtered frame with fields in the
selected frame.
Because the field reference syntax uses the same sigil notation
as a macro we have to use a heuristic to distinguish them:
if the name has a dot it is a field reference, otherwise
it is a macro name.
The reference is synctatically validated at compile time.
There are two main advantages to this implementation (and a couple of
minor ones):
The protocol tree for each selected frame is only walked if we have a
display filter and if the display filter uses references. Also only the
actual reference values are copied, intead of loading the entire tree
into a hash table (in textual form even).
The other advantage is that the reference is tested like a protocol
field against all the values in the selected frame (if there is more
than one).
Currently the reference fields are not "primed" during dissection, so
the entire tree is walked to find a particular reference (this is
similar to the previous implementation).
If the display filter contains a valid reference and the reference is
not loaded at the time the filter is run the result is the same as a
non existing field for a regular READ_TREE instruction.
Fixes#17599.
This usage devalues a mechanism for warning users that deserves more
attention than this minor suggestion.
The warning is inconvenient for intermediate and advanced users.
This change implements a unary minus operator.
Filter: tcp.window_size_scalefactor == -tcp.dstport
Instructions:
00000 READ_TREE tcp.window_size_scalefactor -> reg#0
00001 IF_FALSE_GOTO 6
00002 READ_TREE tcp.dstport -> reg#1
00003 IF_FALSE_GOTO 6
00004 MK_MINUS -reg#1 -> reg#2
00005 ANY_EQ reg#0 == reg#2
00006 RETURN
It is supported for integer types, floats and relative time values.
The unsigned integer types are promoted to a 32 bit signed integer.
Unary plus is implemented as a no-op. The plus sign is simply ignored.
Constant arithmetic expressions are computed during compilation.
Overflow with constants is a compile time error. Overflow with
variables is a run time error and silently ignored. Only a debug
message will be printed to the console.
Related to #15504.
Add support for masking of bits. Before the bitwise operator
could only test bits, it did not support clearing bits.
This allows testing if any combination of bits are set/unset
more naturally with a single test. Previously this was only
possible by combining several bitwise predicates.
Bitwise is implemented as a test node, even though it is not.
Maybe the test node should be renamed to something else.
Fixes#17246.
For unparsed values on the RHS of a comparison try
to parse them first as a literal and only then as
a protocol. This is more complicated in code but
should be a use case a lot more common and useful in
practice.
It removes some annoying special cases and applies this
rule consistently to any expression. Consistency is
important otherwise the special cases and exceptions
make the language confusing and difficult to learn.
For values on the LHS the rule remains to first try a
protocol value, then a literal.
Related with issue #17731.
A literal value is a value that cannot be interpreted as a
registered protocol. An unparsed value can be a literal or
an identifier (protocol/field) according to context and the
current disambiguation rules.
Strictly literal here is to be understood to mean "numeric
literal, including numeric arrays, but not strings or character
constants".
The syntax for protocols and some literals like numbers
and bytes/addresses can be ambiguous. Some protocols can
be parsed as a literal, for example the protocol "fc"
(Fibre Channel) can be parsed as 0xFC.
If a numeric protocol is registered that will also take
precedence over any literal, according to the current
rules, thereby breaking numerical comparisons to that
number. The same for an hypothetical protocol named "true",
etc.
To allow the user to disambiguate this meaning introduce
new syntax.
Any value prefixed with ':' or enclosed in <,> will be treated
as a literal value only. The value :fc or <fc> will always
mean 0xFC, under any context. Never a protocol whose filter
name is "fc".
Likewise any value prefixed with a dot will always be parsed
as an identifier (protocol or protocol field) in the language.
Never any literal value parsed from the token "fc".
This allows the user to be explicit about the meaning,
and between the two explicit methods plus the ambiguous one
it doesn't completely break any one meaning.
The difference can be seen in the following two programs:
Filter: frame == fc
Constants:
Instructions:
00000 READ_TREE frame -> reg#0
00001 IF-FALSE-GOTO 5
00002 READ_TREE fc -> reg#1
00003 IF-FALSE-GOTO 5
00004 ANY_EQ reg#0 == reg#1
00005 RETURN
--------
Filter: frame == :fc
Constants:
00000 PUT_FVALUE fc <FT_PROTOCOL> -> reg#1
Instructions:
00000 READ_TREE frame -> reg#0
00001 IF-FALSE-GOTO 3
00002 ANY_EQ reg#0 == reg#1
00003 RETURN
The filter "frame == fc" is the same as "filter == .fc",
according to the current heuristic, except the first form
will try to parse it as a literal if the name does not
correspond to any registered protocol.
By treating a leading dot as a name in the language we
necessarily disallow writing floats with a leading dot. We
will also disallow writing with an ending dot when using
unparsed values. This is a backward incompatibility but has
the happy side effect of making the expression {1...2}
unambiguous.
This could either mean "1 .. .2" or "1. .. 2". If we require
a leading and ending digit then the meaning is clear:
1.0..0.2 -> 1.0 .. 0.2
Fixes#17731.
To complete the set of equality operators add an "all equal"
operator that matches a frame if all fields match the condition.
The symbol chosen for "all_eq" is "===".
Use that for error messages, including any using test operators.
This allows to always use the same name as the user. It avoids
cases where the user write "a && b" and the message is "a and b"
is syntactically invalid.
It should also allow us to be more consistent with the use of
double quotes.
Instead of requiring a special error function in the parser
just set the syntax_error flag if an error occurs, in any stage
of compilation. Outside of the parser loop it will not be used
but that is fine.
Invalid character constants should be handled in the lexical scanner.
Todo: See if some code could be shared to parse double quoted strings.
It also fixes some unintuitive type coercions to string. Character
constants should be treated as characters, or maybe integers, or
maybe even throw an invalid comparison error, but coverting to a
literal string or byte array is surprising and not particularly
useful:
'\xFF' -> "'\xFF'" (equals)
'\xFF' -> "FF" (contains)
Before:
Filter: http.request.method contains "\x63"
Constants:
00000 PUT_FVALUE "c" <FT_STRING> -> reg#1
(...)
Filter: http.request.method contains '\x63'
Constants:
00000 PUT_FVALUE "63" <FT_STRING> -> reg#1
(...)
Filter: http.request.method == "\x63"
Constants:
00000 PUT_FVALUE "c" <FT_STRING> -> reg#1
(...)
Filter: http.request.method == '\x63'
Constants:
00000 PUT_FVALUE "'\\x63'" <FT_STRING> -> reg#1
(...)
After:
Filter: http.request.method contains '\x63'
Constants:
00000 PUT_FVALUE "c" <FT_STRING> -> reg#1
(...)
Filter: http.request.method == '\x63'
Constants:
00000 PUT_FVALUE "c" <FT_STRING> -> reg#1
(...)
This reverts commit d635ff4933.
A charconst cannot be a value string, for that reason it is not
redundant with unparsed.
Maybe character constants should be parsed in the lexical scanner
instead.
Before:
Filter: ip.proto == '\g'
dftest: "'\g'" cannot be found among the possible values for ip.proto.
After:
Filter: ip.proto == '\g'
dftest: "'\g'" isn't a valid character constant.
Previously a chained expression like "a == b matches c" would be
considered syntactically valid. Fix that to be a syntax error.
Only math-like comparisons can be chained.
This also disallows chaining contains expressions.
A charconst uses the same semantic rules as unparsed so just
use the latter to avoid redundancies.
We keep the use of TOKEN_CHARCONST as an optimization to avoid
an unnecessary name resolution (lookup for a registered field with
the same name as the charconst).
Deprecate the usage of significant whitespace to separate set elements
(or anywhere else for that matter). This will make the implementation
simpler and cleaner and the language more expressive and user-friendly.
Currently unused. This might still be useful to differentiate
different spelling of the same token in user messages, like
"==" and "eq", but currently we are not storing test tokens
anyway, so just remove it, it makes everything simpler.
If it's ever necessary it can be added back.
Using a hand written tokenizer is simpler than using flex start
conditions. Do the validation in the drange node constructor.
Add validation for malformed ranges with different endpoint signs.