aboutsummaryrefslogtreecommitdiffstats
path: root/doc/README.display_filter
blob: db3ee6eb51266fd15dde2584a231ca36418f6617 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
(This is a consolidation of documentation written by stig, sahlberg, and gram)

What is the display filter system?
==================================
The display filter system allows the user to select packets by testing
for values in the proto_tree that Wireshark constructs for that packet.
Every proto_item in the proto_tree has an 'abbrev' field
and a 'type' field, which tells the display filter engine the name
of the field and its type (what values it can hold).

For example, this is the definition of the ip.proto field from packet-ip.c:

{ &hf_ip_proto,
      { "Protocol", "ip.proto", FT_UINT8, BASE_DEC | BASE_EXT_STRING,
              &ipproto_val_ext, 0x0, NULL, HFILL }},

This definition says that "ip.proto" is the display-filter name for
this field, and that its field-type is FT_UINT8.

The display filter system has 3 major parts to it:

    1. A type system (field types, or "ftypes")
    2. A parser, to convert a user's query to an internal representation
    3. An engine that uses the internal representation to select packets.


code:
epan/dfilter/* - the display filter engine, including
		scanner, parser, syntax-tree semantics checker, DFVM bytecode
		generator, and DFVM engine.
epan/ftypes/* - the definitions of the various FT_* field types.
epan/proto.c   - proto_tree-related routines


The field type system
=====================
The field type system is stored in epan/ftypes.

The proto_tree system #includes ftypes.h, which gives it the ftenum
definition, which is the enum of all possible ftypes:

/* field types */
enum ftenum {
    FT_NONE,        /* used for text labels with no value */
    FT_PROTOCOL,
    FT_BOOLEAN,     /* TRUE and FALSE come from <glib.h> */
    FT_UINT8,
    FT_UINT16,
    FT_UINT24,      /* really a UINT32, but displayed as3 hex-digits if FD_HEX*/
    FT_UINT32,
    FT_UINT64,
    etc., etc.
}

It also provides the definition of fvalue_t, the struct that holds the *value*
that corresponds to the type. Each proto_item (proto_node) holds an fvalue_t
due to having a field_info struct (defined in proto.h).

The fvalue_t is mostly just a gigantic union of possible C-language types
(as opposed to FT_* types):

typedef struct _fvalue_t {
        ftype_t *ftype;
        union {
                /* Put a few basic types in here */
                guint32         uinteger;
                gint32          sinteger;
                guint64         integer64;
                gdouble         floating;
                gchar           *string;
                guchar          *ustring;
                GByteArray      *bytes;
                ipv4_addr       ipv4;
                ipv6_addr       ipv6;
                e_guid_t        guid;
                nstime_t        time;
                tvbuff_t        *tvb;
                GRegex          *re;
        } value;

        /* The following is provided for private use
         * by the fvalue. */
        gboolean        fvalue_gboolean1;

} fvalue_t;


Defining a field type
---------------------
The ftype system itself is designed to be modular, so that new field types
can be added when necessary.

Each field type must implement an ftype_t structure, also defined in
ftypes.h. This is the way a field type is registered with the ftype engine.

If you take a look at ftype-integer.c, you will see that it provides
an ftype_register_integers() function, that fills in many such ftype_t
structs. It creates one for each integer type: FT_UINT8, FT_UINT16,
FT_UINT32, etc.

The ftype_t struct defines the things needed for the ftype:

    * its ftenum value
    * a string representation of the FT name ("FT_UINT8")
    * how much data it consumes in the packet
    * how to store that value in an fvalue_t: new(), free(),
        various value-related functions
    * how to compare that value against another
    * how to slice that value (strings and byte ranges can be sliced)

Using an fvalue_t
-----------------
Once the value of a field is stored in an fvalue_t (stored in
each proto_item via field_info), it's easy to use those values,
thanks to the various fvalue_*() functions defined in ftypes.h.

Functions like fvalue_get(), fvalue_eq(), etc., are all generic
interfaces to get information about the field's value. They work
on any field type because of the ftype_t struct, which is the lookup
table that the field-type engine uses to work with any field type.

The display filter parser
=========================
The display filter parser (along with the comparison engine)
is stored in epan/dfilter.

The scanner/parser pair read the string representing the display filter
and convert it into a very simple syntax tree.  The syntax tree is very
simple in that it is possible that many of the nodes contain unparsed
chunks of text from the display filter.

There are four phases to parsing a user's request:

 1. Scanning the string for dfilter syntax
 2. Parsing the keywords according to the dfilter grammar, into a
        syntax tree
 3. Doing a semantic check of the nodes in that syntax tree
 4. Converting the syntax tree into a series of DFVM byte codes

The dfilter_compile() function, in epan/dfilter/dfilter.c,
runs these 4 phases. The end result is a dfwork_t object (dfw), that
can be passed to dfilter_apply() to actually run the display filter
against a set of proto_trees.


Scanning the display filter string
----------------------------------
epan/dfilter/scanner.l is the lex scanner for finding keywords
in the user's display filter string.

Its operation is simple. It finds the special function and comparison
operators ("==", "!=", "eq", "ne", etc.), it finds slice operations
( "[0:1]" ), quoted strings, IP addresses, numbers, and any other "special"
keywords or string types.

Anything it doesn't know how to handle is passed to to grammar parser
as an unparsed string (TOKEN_UNPARSED). This includes field names. The
scanner does not interpret any protocol field names at all.

The scanner has to return a token type (TOKEN_*, and in many cases,
a value. The value will be an stnode_t struct, which is a syntax
tree node object. Since the final storage of the parse will
be in a syntax tree, it is convenient for the scanner to fill in
syntax tree nodes with values when it can.

The stnode_t definition is in epan/dfilter/syntax-tree.h


Parsing the keywords according to the dfilter grammar
-----------------------------------------------------
The grammar parser is implemented with the 'lemon' tool,
rather than the traditional yacc or bison grammar parser,
as lemon grammars were found to be easier to work with. The
lemon parser specification (epan/dfilter/grammar.lemon) is
much easier to read than its bison counterpart would be,
thanks to lemon's feature of being able to name fields, rather
then using numbers ($1, $2, etc.)

The lemon tool is located in tools/lemon in the Wireshark
distribution.

An on-line introduction to lemon is available at:

http://www.sqlite.org/src/doc/trunk/doc/lemon.html

The grammar specifies which type of constructs are possible
within the dfilter language ("dfilter-lang")

An "expression" in dfilter-lang can be a relational test or a logical test.

A relational test compares a value against another, which is usually
a field (or a slice of a field) against some static value, like:

    ip.proto == 1
    eth.dst != ff:ff:ff:ff:ff:ff

A logical test combines other expressions with "and", "or", and "not".

At the end of the grammatical parsing, the dfw object will
have a valid syntax tree, pointed at by dfw->st_root.

If there is an error in the syntax, the parser will call dfilter_fail()
with an appropriate error message, which the UI will need to report
to the user.

The syntax tree system
----------------------
The syntax tree is created as a result of running the lemon-based
grammar parser on the scanned tokens. The syntax tree code
is in epan/dfilter/syntax-tree* and epan/dfilter/sttree-*. It too
uses a set of code modules that implement different syntax node types,
similar to how the field-type system registers a set of ftypes
with a central engine.

Each node (stnode_t) in the syntax tree has a type (sttype).
These sttypes are very much related to ftypes (field types), but there
is not a one-to-one correspondence. The syntax tree nodes are slightly
high-level. For example, there is only a single INTEGER sttype, unlike
the ftype system that has a type for UINT64, UINT32, UINT16, UINT8, etc.

typedef enum {
        STTYPE_UNINITIALIZED,
        STTYPE_TEST,
        STTYPE_UNPARSED,
        STTYPE_STRING,
        STTYPE_FIELD,
        STTYPE_FVALUE,
        STTYPE_INTEGER,
        STTYPE_RANGE,
        STTYPE_FUNCTION,
        STTYPE_NUM_TYPES
} sttype_id_t;


The root node of the syntax tree is the main test or comparison
being done.

Semantic Check
--------------
After the parsing is done and a syntax tree is available, the
code in semcheck.c does a semantic check of what is in the syntax
tree.

The semantics of the simple syntax tree are checked to make sure that
the fields that are being compared are being compared to appropriate
values.  For example, if a field is an integer, it can't be compared to
a string, unless a value_string has been defined for that field.

During the process of checking the semantics, the simple syntax tree is
fleshed out and no longer contains nodes with unparsed information.  The
syntax tree is no longer in its simple form, but in its complete form.

For example, if the dfilter is slicing a field and comparing
against a set of bytes, semcheck.c has to check that the field
in question can indeed be sliced.

Or, can a field be compared against a certain type of value (string,
integer, float, IPv4 address, etc.)

The semcheck code also makes adjustments to the syntax tree
when it needs to. The parser sometimes stores raw, unparsed strings
in the syntax tree, and semcheck has to convert them to
certain types. For example, the display filter may contain
a value_string string (the "enum" type that protocols can use
to define the possible textual descriptions of numeric fields), and
semcheck will convert that value_string string into the correct
integer value.

Truth be told, the semcheck.c code is a bit disorganized, and could
be re-designed & re-written.

DFVM Byte Codes
---------------
The syntax tree is analyzed to create a sequence of bytecodes in the
"DFVM" language.  "DFVM" stands for Display Filter Virtual Machine.  The
DFVM is similar in spirit, but not in definition, to the BPF VM that
libpcap uses to analyze packets.

A virtual bytecode is created and used so that the actual process of
filtering packets will be fast.  That is, it should be faster to process
a list of VM bytecodes than to attempt to filter packets directly from
the syntax tree.  (heh...  no measurement has been made to support this
supposition)

The DFVM opcodes are defined in epan/dfilter/dfvm.h (dfvm_opcode_t).
Similar to how the BPF opcode system works in libpcap, there is a
limited set of opcodes. They operate by loading values from the
proto_tree into registers, loading pre-defined values into
registers, and comparing them. The opcodes are checked in sequence, and
there are only 2 branching opcodes: IF_TRUE_GOTO and IF_FALSE_GOTO.
Both of these can only branch forwards, and never backwards. In this way
sets of DFVM instructions will never get into an infinite loop.

The epan/dfilter/gencode.c code converts the syntax tree
into a set of dvfm instructions.

The constants that are in the DFVM instructions (the constant
values that the user is checking against) are pre-loaded
into registers via the dvfm_init_const() call, and stored
in the dfilter_t structure for when the display filter is
actually applied.


DFVM Engine
===========
Once the DFVM bytecode has been produced, it's a simple matter of
running the DFVM engine against the proto_tree from the packet
dissection, using the DFVM bytecodes as instructions.  If the DFVM
bytecode is known before packet dissection occurs, the
proto_tree-related code can be "primed" to store away pointers to
field_info structures that are interesting to the display filter.  This
makes lookup of those field_info structures during the filtering process
faster.

The dfilter_apply() function runs a single pre-compiled
display filter against a single proto_tree function, and returns
TRUE or FALSE, meaning that the filter matched or not.

That function calls dfvm_apply(), which runs across the DFVM
instructions, loading protocol field values into DFVM registers
and doing the comparisons.

There is a top-level Makefile target called 'dftest' which
builds a 'dftest' executable that will print out the DFVM
bytecode for any display filter given on the command-line.
To build it, run:

$ make dftest

To use it, give it the display filter on the command-line:

$ ./dftest 'ip.addr == 127.0.0.1'
Filter: "ip.addr == 127.0.0.1"

Constants:
00000 PUT_FVALUE        127.0.0.1 <FT_IPv4> -> reg#1

Instructions:
00000 READ_TREE         ip.addr -> reg#0
00001 IF-FALSE-GOTO     3
00002 ANY_EQ            reg#0 == reg#1
00003 RETURN


The output shows the original display filter, then the opcodes
that put constant values into registers. The registers are
numbered, and are shown in the output as "reg#n", where 'n' is the
identifying number.

Then the instructions are shown. These are the instructions
which are run for each proto_tree.

This is what happens in this example:

00000 READ_TREE         ip.addr -> reg#0

Any ip.addr fields in the proto_tree are loaded into register 0. Yes,
multiple values can be loaded into a single register. As a result
of this READ_TREE, the accumulator will hold TRUE or FALSE, indicating
if any field's value was loaded, or not.

00001 IF-FALSE-GOTO     3

If the load failed because there were no ip.addr fields
in the proto_tree, then we jump to instruction 3.

00002 ANY_EQ            reg#0 == reg#1

This checks to see if any of the fields in register 0 
(which has the pre-loaded constant value of 127.0.0.1) are equal
to any of the fields in register 1 (which are all of the ip.addr
fields in the proto tree). The resulting value in the
accumulator will be TRUE if any of the fields match, or FALSE
if none match.

00003 RETURN

This returns the accumulator's value, either TRUE or FALSE.

In addition to dftest, there is also a tools/dfilter-test script
which is a unit-test script for the display filter engine.
It makes use of text2pcap and tshark to run specific display
filters against specific captures (embedded within dfilter-test).



Display Filter Functions
========================
You define a display filter function by adding an entry to
the df_functions table in epan/dfilter/dfunctions.c. The record struct
is defined in dfunctions.h, and shown here:

typedef struct {
    char            *name;
    DFFuncType      function;
    ftenum_t        retval_ftype;
    guint           min_nargs;
    guint           max_nargs;
    DFSemCheckType  semcheck_param_function;
} df_func_def_t;

name - the name of the function; this is how the user will call your
    function in the display filter language

function - this is the run-time processing of your function.

retval_ftype - what type of FT_* type does your function return?

min_nargs - minimum number of arguments your function accepts
max_nargs - maximum number of arguments your function accepts

semcheck_param_function - called during the semantic check of the
    display filter string.

DFFuncType function
-------------------
typedef gboolean (*DFFuncType)(GList *arg1list, GList *arg2list, GList **retval);

The return value of your function is a gboolean; TRUE if processing went fine,
or FALSE if there was some sort of exception.

For now, display filter functions can accept a maximum of 2 arguments.
The "arg1list" parameter is the GList for the first argument. The
'arg2list" parameter is the GList for the second argument. All arguments
to display filter functions are lists. This is because in the display
filter language a protocol field may have multiple instances. For example,
a field like "ip.addr" will exist more than once in a single frame. So
when the user invokes this display filter:

    somefunc(ip.addr) == TRUE

even though "ip.addr" is a single argument, the "somefunc" function will
receive a GList of *all* the values of "ip.addr" in the frame.

Similarly, the return value of the function needs to be a GList, since all
values in the display filter language are lists. The GList** retval argument
is passed to your function so you can set the pointer to your return value.

DFSemCheckType
--------------
typedef void (*DFSemCheckType)(int param_num, stnode_t *st_node);

For each parameter in the syntax tree, this function will be called.
"param_num" will indicate the number of the parameter, starting with 0.
The "stnode_t" is the syntax-tree node representing that parameter.
If everything is okay with the value of that stnode_t, your function
does nothing --- it merely returns. If something is wrong, however,
it should THROW a TypeError exception.


Example: add an 'in' display filter operation
=============================================

This example has been discussed on wireshark-dev in April 2004. It illustrates
how a more complex operation can be added to the display filter language.

Question:

	If I want to add an 'in' display filter operation, I need to define
	several things. This can happen in different ways. For instance,
	every value from the "in" value collection will result in a test.
	There are 2 options here, either a test for a single value:

		(x in {a b c})

	or a test for a value in a given range:

		(x in {a ... z})

	or even a combination of both. The former example can be reduced to:

		((x == a) or (x == b) or (x == c))

	while the latter can be reduced to

		((x >= MIN(a, z)) and (x <= MAX(a, z)))

	I understand that I can replace "x in {" with the following steps:
	first store x in the "in" test buffer, then add "(" to the display
	filter expression internally.

	Similarly I can replace the closing brace "}" with the following
	steps: release x from the "in" test buffer and then add ")"
	to the display filter expression internally.

	How could I do this?

Answer:

	This could be done in grammar.lemon. The grammar would produce
	syntax tree nodes, combining them with "or", when it is given
	tokens that represent the "in" syntax.

	It could also be done later in the process, maybe in
	semcheck.c. But if you can do it earlier, in grammar.lemon,
	then you shouldn't have to worry about modifying anything in
	semcheck.c, as the syntax tree that is passed to semcheck.c
	won't contain any new type of operators... just lots of nodes
	combined with "or".

How to add an operator FOO to the display filter language?
==========================================================

Go to wireshark/epan/dfilter/

Edit grammar.lemon and add the operator. Add the operator FOO and the
test logic (defining TEST_OP_FOO).

Edit scanner.l and add the operator name(s) hence defining
TOKEN_TEST_FOO. Also update the simple() or add the new operand's code.

Edit sttype-test.h and add the TEST_OP_FOO to the list of test operations.

Edit sttype-test.c and add TEST_OP_FOO to the num_operands() method.

Edit gencode.c, add TEST_OP_FOO in the gen_test() method by defining
ANY_FOO.

Edit dfvm.h and add ANY_FOO to the enum dfvm_opcode_t structure.

Edit dfvm.c and add ANY_FOO to dfvm_dump() (for the dftest display filter
test binary), to dfvm_apply() hence defining the methods fvalue_foo().

Edit semcheck.c and look at the check_relation_XXX() methods if they
still apply to the foo operator; if not, amend the code. Start from the
check_test() method to discover the logic.

Go to wireshark/epan/ftypes/

Edit ftypes.h and declare the fvalue_foo(), ftype_can_foo() and
fvalue_foo() methods. Add the cmp_foo() method to the struct _ftype_t.

This is the first time that a make in wireshark/epan/dfilter/ can
succeed. If it fails, then some code in the previously edited files must
be corrected.

Edit ftypes.c and define the fvalue_foo() method with its associated
logic. Define also the ftype_can_foo() and fvalue_foo() methods.

Edit all ftype-*.c files and add the required fvalue_foo() methods.

This is the point where you should be able to compile without errors in
wireshark/epan/ftypes/. If not, first fix the errors.

Go to wireshark/epan/ and run make. If this one succeeds, then we're
almost done as no errors should occur here.

Go to wireshark/ and run make. One thing to do is make dftest and see
if you can construct valid display filters with your new operator. Or
you may want to move directly to the generation of Wireshark.

Look also at wireshark/gtk/dfilter_expr_dlg.c and edit the display filter
expression generator.

How to add a new test to dfilter-test.py
========================================
Note: dfilter-test.py requires Python 2.7

"tools/dfilter-test.py" is the main test script. It includes
the test from files in tools/dftestlib. You can add a test
to a file in tools/dftestlib, or you can create a new file
in tools/dftestlib. If you do add a new file, you must
import it (and the class it defines) in "tools/dfilter-test.py"

Each new test class must define "trace_file", which names
a capture file in "tools/dftestfiles". All the tests
run in that class will use that one capture file.

There are 2 methods you can use for testing:

assertDfilter(dfilter_text, expected_count)

    This will run the display filter through tshark, on the
    file named by "trace_file", and assert that the
    number of resulting packets equals "expected_count". This
    also asserts that tshark does not fail; success with zero
    matches is not the same as failure to compile the display
    filter string.

assertDFilterFail(dfilter_text)

    This will run tshark with the display filter, and
    asser that tshark fails. This is useful when expecting
    display filter syntax errors to be caught.

Then, simply run "dfilter-test.py". You can run the tests
in a single Test class by naming that -class on the
dfilter-test.py command-line, or even run a single
test by naming it. E.g., the following are all valid ways
of running dfilter-test.py:

# Run all tests
$ ./tools/dfilter-test.py

# Run all tests in "testTVB"
$ ./tools/dfilter-test.py testTVB

# Run the the "test_contains_1" test from testTVB
$ ./tools/dfilter-test.py testTVB.test_contains_1

Note that dfilter-test.py should be run from the top of the
Wireshark distribution, so it knows where to find the default
tshark executable.