2003-12-06 06:09:13 +00:00
|
|
|
Protocol Dissection in XML Format
|
|
|
|
=================================
|
2004-07-18 00:24:25 +00:00
|
|
|
$Id$
|
2003-12-06 06:09:13 +00:00
|
|
|
Copyright (c) 2003 by Gilbert Ramirez <gram@alumni.rice.edu>
|
|
|
|
|
|
|
|
|
2006-05-31 19:12:15 +00:00
|
|
|
Wireshark has the ability to export its protocol dissection in an
|
2006-05-31 17:38:42 +00:00
|
|
|
XML format, tshark has similar functionality by using the "-Tpdml"
|
2004-04-30 07:56:33 +00:00
|
|
|
option.
|
2003-12-06 06:09:13 +00:00
|
|
|
|
2006-05-22 09:05:24 +00:00
|
|
|
The XML that wireshark produces follows the Packet Details Markup
|
2003-12-06 06:09:13 +00:00
|
|
|
Language (PDML) specified by the group at the Politecnico Di Torino
|
|
|
|
working on Analyzer. The specification can be found at:
|
|
|
|
|
|
|
|
http://analyzer.polito.it/30alpha/docs/dissectors/PDMLSpec.htm
|
|
|
|
|
|
|
|
A related XML format, the Packet Summary Markup Language (PSML), is
|
|
|
|
also defined by the Analyzer group to provide packet summary information.
|
|
|
|
The PSML format is not documented in a publicly-available HTML document,
|
2006-05-31 19:12:15 +00:00
|
|
|
but its format is simple. Wireshark can export this format too. Some day it
|
2006-05-31 17:38:42 +00:00
|
|
|
may be added to tshark so that "-Tpsml" would produce PSML.
|
2003-12-06 06:09:13 +00:00
|
|
|
|
|
|
|
One wonders if the "-T" option should read "-Txml" instead of "-Tpdml"
|
2006-05-31 17:38:42 +00:00
|
|
|
(and in the future, "-Tpsml"), but if tshark was required to produce
|
2003-12-06 06:09:13 +00:00
|
|
|
another XML-based format of its protocol dissection, then "-Txml" would
|
|
|
|
be ambiguous.
|
|
|
|
|
|
|
|
PDML
|
|
|
|
====
|
2006-05-22 09:05:24 +00:00
|
|
|
The PDML that wireshark produces is known not to be loadable into Analyzer.
|
|
|
|
It causes Analyzer to crash. As such, the PDML that wireshark produces
|
2007-10-12 19:13:31 +00:00
|
|
|
is be labeled with a version number of "0", which means that the PDML does
|
|
|
|
not fully follow the PDML spec. Furthermore, a creator attribute in the
|
2006-05-31 19:12:15 +00:00
|
|
|
"<pdml>" tag gives the version number of wireshark/tshark that produced the PDML.
|
2006-05-22 08:14:01 +00:00
|
|
|
In that way, as the PDML produced by wireshark matures, but still does not
|
2003-12-06 06:09:13 +00:00
|
|
|
meet the PDML spec, scripts can make intelligent decisions about how to
|
|
|
|
best parse the PDML, based on the "creator" attribute.
|
|
|
|
|
|
|
|
A PDML file is delimited by a "<pdml>" tag.
|
|
|
|
A PDML file contains multiple packets, denoted by the "<packet>" tag.
|
|
|
|
A packet will contain multiple protocols, denoted by the "<proto>" tag.
|
|
|
|
A protocol might contain one or more fields, denoted by the "<field>" tag.
|
|
|
|
|
|
|
|
A pseudo-protocol named "geninfo" is produced, as is required by the PDML
|
2004-04-30 07:56:33 +00:00
|
|
|
spec, and exported as the first protocol after the opening "<packet>" tag.
|
2007-10-12 19:13:31 +00:00
|
|
|
Its information comes from wireshark's "frame" protocol, which serves
|
2003-12-06 06:09:13 +00:00
|
|
|
the similar purpose of storing packet meta-data. Both "geninfo" and
|
|
|
|
"frame" protocols are provided in the PDML output.
|
|
|
|
|
|
|
|
The "<pdml>" tag
|
|
|
|
================
|
|
|
|
Example:
|
2006-05-31 19:12:15 +00:00
|
|
|
<pdml version="0" creator="wireshark/0.9.17">
|
2003-12-06 06:09:13 +00:00
|
|
|
|
2006-05-31 19:12:15 +00:00
|
|
|
The creator is "wireshark" (i.e., the "wireshark" engine. It will always say
|
|
|
|
"wireshark", not "tshark") version 0.9.17.
|
2003-12-06 06:09:13 +00:00
|
|
|
|
|
|
|
|
|
|
|
The "<proto>" tag
|
|
|
|
=================
|
|
|
|
"<proto>" tags can have the following attributes:
|
|
|
|
|
|
|
|
name - the display filter name for the protocol
|
|
|
|
showname - the label used to describe this protocol in the protocol
|
|
|
|
tree. This is usually the descriptive name of the protocol,
|
|
|
|
but it can be modified by dissectors to include more data
|
|
|
|
(tcp can do this)
|
|
|
|
pos - the starting offset within the packet data where this
|
2007-10-12 19:13:31 +00:00
|
|
|
protocol starts
|
2003-12-06 06:09:13 +00:00
|
|
|
size - the number of octets in the packet data that this protocol
|
2007-10-12 19:13:31 +00:00
|
|
|
covers.
|
2003-12-06 06:09:13 +00:00
|
|
|
|
|
|
|
The "<field>" tag
|
|
|
|
=================
|
|
|
|
"<field>" tags can have the following attributes:
|
|
|
|
|
|
|
|
name - the display filter name for the field
|
|
|
|
showname - the label used to describe this field in the protocol
|
|
|
|
tree. This is usually the descriptive name of the protocol,
|
2007-10-12 19:13:31 +00:00
|
|
|
followed by some representation of the value.
|
2003-12-06 06:09:13 +00:00
|
|
|
pos - the starting offset within the packet data where this
|
2007-10-12 19:13:31 +00:00
|
|
|
field starts
|
2003-12-06 06:09:13 +00:00
|
|
|
size - the number of octets in the packet data that this field
|
2007-10-12 19:13:31 +00:00
|
|
|
covers.
|
2003-12-06 06:09:13 +00:00
|
|
|
value - the actual packet data, in hex, that this field covers
|
|
|
|
show - the representation of the packet data ('value') as it would
|
|
|
|
appear in a display filter.
|
|
|
|
|
|
|
|
Some dissectors sometimes place text into the protocol tree, without using
|
|
|
|
a field with a field-name. Those appear in PDML as "<field>" tags with no
|
|
|
|
'name' attribute, but with a 'show' attribute giving that text.
|
|
|
|
|
|
|
|
Many dissectors label the undissected payload of a protocol as belonging
|
|
|
|
to a "data" protocol, and the "data" protocol usually resided inside
|
|
|
|
that last protocol dissected. In the PDML, The "data" protocol becomes
|
2006-05-22 08:14:01 +00:00
|
|
|
a "data" field, placed exactly where the "data" protocol is in wireshark's
|
2006-05-22 09:05:24 +00:00
|
|
|
protocol tree. So, if wireshark would normally show:
|
2003-12-06 06:09:13 +00:00
|
|
|
|
|
|
|
+-- Frame
|
|
|
|
|
|
|
|
|
+-- Ethernet
|
|
|
|
|
|
|
|
|
+-- IP
|
|
|
|
|
|
|
|
|
+-- TCP
|
|
|
|
|
|
|
|
|
+-- HTTP
|
|
|
|
|
|
|
|
|
+-- Data
|
|
|
|
|
|
|
|
In PDML, the "Data" protocol would become another field under HTTP:
|
|
|
|
|
|
|
|
<packet>
|
|
|
|
<proto name="frame">
|
|
|
|
...
|
|
|
|
</proto>
|
|
|
|
|
|
|
|
<proto name="eth">
|
|
|
|
...
|
|
|
|
</proto>
|
|
|
|
|
|
|
|
<proto name="ip">
|
|
|
|
...
|
|
|
|
</proto>
|
|
|
|
|
|
|
|
<proto name="tcp">
|
|
|
|
...
|
|
|
|
</proto>
|
|
|
|
|
|
|
|
<proto name="http">
|
|
|
|
...
|
|
|
|
<field name="data" value="........."/>
|
|
|
|
</proto>
|
|
|
|
</packet>
|
|
|
|
|
|
|
|
|
|
|
|
|
2006-05-31 19:12:15 +00:00
|
|
|
tools/WiresharkXML.py
|
2003-12-06 06:09:13 +00:00
|
|
|
====================
|
2007-10-12 19:13:31 +00:00
|
|
|
This is a python module which provides some infrastructure for
|
2003-12-06 06:09:13 +00:00
|
|
|
Python developers who wish to parse PDML. It is designed to read
|
|
|
|
a PDML file and call a user's callback function every time a packet
|
|
|
|
is constructed from the protocols and fields for a single packet.
|
|
|
|
|
|
|
|
The python user should import the module, define a callback function
|
|
|
|
which accepts one argument, and call the parse_fh function:
|
|
|
|
|
|
|
|
------------------------------------------------------------
|
2006-05-31 19:12:15 +00:00
|
|
|
import WiresharkXML
|
2003-12-06 06:09:13 +00:00
|
|
|
|
|
|
|
def my_callback(packet):
|
|
|
|
# do something
|
|
|
|
|
|
|
|
fh = open(xml_filename)
|
2006-05-31 19:12:15 +00:00
|
|
|
WiresharkXML.parse_fh(fh, my_callback)
|
2003-12-06 06:09:13 +00:00
|
|
|
|
2007-10-12 19:13:31 +00:00
|
|
|
# Now that the script has the packet data, do something.
|
2003-12-06 06:09:13 +00:00
|
|
|
------------------------------------------------------------
|
|
|
|
|
|
|
|
The object that is passed to the callback function is an
|
2006-05-31 19:12:15 +00:00
|
|
|
WiresharkXML.Packet object, which corresponds to a single packet.
|
|
|
|
WiresharkXML Provides 3 classes, each of which corresponds to a PDML tag:
|
2003-12-06 06:09:13 +00:00
|
|
|
|
|
|
|
Packet - "<packet>" tag
|
|
|
|
Protocol - "<proto>" tag
|
|
|
|
Field - "<field>" tag
|
|
|
|
|
|
|
|
Each of these classes has accessors which will return the defined attributes:
|
|
|
|
|
|
|
|
get_name()
|
|
|
|
get_showname()
|
|
|
|
get_pos()
|
|
|
|
get_size()
|
|
|
|
get_value()
|
|
|
|
get_show()
|
|
|
|
|
|
|
|
Protocols and fields can contain other fields. Thus, the Protocol and
|
|
|
|
Field class have a "children" member, which is a simple list of the
|
|
|
|
Field objects, if any, that are contained. The "children" list can be
|
2007-12-17 09:42:33 +00:00
|
|
|
directly accessed by code using the object. The "children" list will be
|
|
|
|
empty if this Protocol or Field contains no Fields.
|
2003-12-06 06:09:13 +00:00
|
|
|
|
2007-10-12 19:13:31 +00:00
|
|
|
Furthermore, the Packet class is a sub-class of the PacketList class.
|
2003-12-06 06:09:13 +00:00
|
|
|
The PacketList class provides methods to look for protocols and fields.
|
|
|
|
The term "item" is used when the item being looked for can be
|
|
|
|
a protocol or a field:
|
|
|
|
|
|
|
|
item_exists(name) - checks if an item exists in the PacketList
|
|
|
|
get_items(name) - returns a PacketList of all matching items
|
|
|
|
|
|
|
|
|
|
|
|
General Notes
|
|
|
|
=============
|
|
|
|
Generally, parsing XML is slow. If you're writing a script to parse
|
2006-05-31 17:38:42 +00:00
|
|
|
the PDML output of tshark, pass a read filter with "-R" to tshark to
|
|
|
|
try to reduce as much as possible the number of packets coming out of tshark.
|
2003-12-06 06:09:13 +00:00
|
|
|
The less your script has to process, the faster it will be.
|
|
|
|
|
2007-10-12 19:13:31 +00:00
|
|
|
'tools/msnchat' is a sample Python program that uses WiresharkXML to parse
|
|
|
|
PDML. Given one or more capture files, it runs tshark on each of them,
|
|
|
|
providing a read filter to reduce tshark's output. It finds MSN Chat
|
|
|
|
conversations in the capture file and produces nice HTML showing the
|
|
|
|
conversations. It has only been tested with capture files containing
|
|
|
|
non-simultaneous chat sessions, but was written to more-or-less handle any
|
|
|
|
number of simultaneous chat sessions.
|