wireshark/epan/dissectors/packet-rpcrdma.h

41 lines
1.2 KiB
C
Raw Normal View History

RPC-over-RDMA: add reassembly for reply, read and write chunks The RDMA reply chunk is used for a large RPC reply which does not fit into a single SEND operation and does not have a single large opaque, e.g., NFS READDIR. The RPC call packet is used only to set up the RDMA reply chunk. The whole RPC reply is transferred via RDMA writes. Fragments are added on any RDMA write packet, RDMA_WRITE_ONLY, RDMA_WRITE_FIRST, etc., and the reassembly is done on the reply message. The RPC reply packet has no data (RDMA_NOMSG) but fragments are reassembled and the whole RPC reply is dissected. The RDMA read chunk list is used for a large RPC call which has at least one large opaque, e.g., NFS WRITE. The RPC call packet is used only to set up the RDMA read chunk list. It also has the reduced message data which includes the first fragment (XDR data up to and including the opaque length), but it could also have fragments between each read chunk and the last fragment after the last read chunk data. The reduced message is then broken down into fragments and inserted into the reassembly table. Since the RDMA read chunk list is set up in the RPC call then do not dissect the upper layer in this case and just label rest of packet as "Data" since the reassembly will be done on the last read response. The protocol gives the XDR position where each chunk must be inserted into the XDR stream thus as long as the maximum I/O size is known it is possible to know exactly where to insert these fragments. This maximum I/O size is set on the first READ_RESPONSE_FIRST or READ_RESPONSE_MIDDLE but in case where any of these packets have not been seen then a value of 100 is used (real value should be at least 1024) but in this case the message numbers are not consecutive between chunks but since the total size of all chunks is verified to make sure there is a complete message to reassemble then all fragments should be in the correct order. Fragments are added on any RDMA read packet: RDMA_READ_RESPONSE_ONLY, RDMA_READ_RESPONSE_FIRST, etc., and the reassembly is done on the last read response. Since there could be multiple chunks and each chunk could have multiple segments then the total size must be checked to complete the reassembly because in this case there will be multiple READ_RESPONSE_LAST. The RDMA write chunk list is used for a large RPC reply which has at least one large opaque, e.g., NFS READ. The RPC call packet is used only to set up the RDMA write chunk list. The opaque data is then transferred via RDMA writes and then the RPC reply packet is sent from the server. The RPC reply packet has the reduced message data which includes the first fragment (XDR data up to and including the opaque length), but it could also have fragments between each write chunk and the last fragment after the last write chunk data. The reduced message is then broken down into fragments and inserted into the reassembly table. Since the RPC reply is sent after all the RDMA writes then the fragments from these writes must be inserted in the correct order: the first RDMA write fragment is inserted with message number 1, since the first fragment (message number 0) will come from the very last packet (the RPC reply with RDMA_MSG). Also, the last packet could have fragments which must be inserted in between chunk data, therefore message numbers from one chunk to another are not consecutive. In contrast with the RDMA read chunk list, the protocol does not allow an XDR position in the RDMA write chunks, since the RPC client knows exactly where to insert the chunk's data because of the virtual address of the DDP (direct data placement) item. There is no way to map a virtual address with an XDR position, thus in order to reassemble the XDR stream a two pass approach is used. In the first pass (visited = 0), all RDMA writes are inserted as fragments leaving a gap in between each chunk. Then the dissector for the upper layer is called with a flag letting the dissector know that it is dealing with a reduced message so all DDP enabled operations handle the opaque data as having only the size of the opaque but not the data and reporting back the offset from the end of the message. Once the upper layer dissector returns, this layer now has a list of DDP eligible item's offsets which are then translated into XDR offsets and then the RPC reply packet is broken into fragments and inserted in the right places as in the case for the RDMA read chunk list. On the second pass (visited = 1), all fragments have already been inserted into the reassembly table so it just needs to reassembled the whole message and then call the upper layer dissector. RFC 8267 specifies the upper layer bindings to RPC-over-RDMA version 1 for NFS. Since RPC-over-RDMA version 1 specifies the XDR position for the read chunks then only the write chunk DDP eligible items are handled in the upper layer, in this case the NFS layer. These are the only procedures or operations eligible for write chunks: * The opaque data result in the NFS READ procedure or operation * The pathname or linkdata result in the NFS READLINK procedure or operation Two functions are defined to signal and report back the DDP eligible item's offset to be used by the upper layers. Function rpcrdma_is_reduced() is used to signal the upper layer that it is dealing with a reduced data message and thus should ignore DDP eligible item's opaque processing and just report back the offset where the opaque data should be. This reporting is done using the second function rpcrdma_insert_offset(). Reassembly is done for InfiniBand only. Reassemble fragments using the packet sequence number (PSN) of each RDMA I/O fragment to make sure the message is reassembled correctly when fragments are sent out of order. Also a unique message id is used for each message so fragments are reassembled correctly when fragments of different messages are sent in parallel. The reassembled message could be composed of multiple chunks and each chunk in turn could be composed of multiple segments in which each segment could be composed of multiple requests and of course each request is composed of one or more fragments. Thus in order to have all fragments for each segment belonging to the same message, a list of segments is created and all segments belonging to the same message are initialized with the same message id. These segments are initialized and added to the list on the call side on RDMA_MSG by calling process_rdma_lists. Bug: 13260 Change-Id: Icf57d7c46c3ba1de5d019265eb151a81d6019dfd Reviewed-on: https://code.wireshark.org/review/24613 Petri-Dish: Anders Broman <a.broman58@gmail.com> Tested-by: Petri Dish Buildbot Reviewed-by: Anders Broman <a.broman58@gmail.com>
2017-11-14 21:55:14 +00:00
/* packet-rpcrdma.h
*
* Wireshark - Network traffic analyzer
* By Gerald Combs <gerald@wireshark.org>
* Copyright 1998 Gerald Combs
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
*/
#ifndef __PACKET_RPCRDMA_H_
#define __PACKET_RPCRDMA_H_
extern gboolean rpcrdma_is_reduced(void);
extern void rpcrdma_insert_offset(gint offset);
#endif
/*
* Editor modelines - https://www.wireshark.org/tools/modelines.html
RPC-over-RDMA: add reassembly for reply, read and write chunks The RDMA reply chunk is used for a large RPC reply which does not fit into a single SEND operation and does not have a single large opaque, e.g., NFS READDIR. The RPC call packet is used only to set up the RDMA reply chunk. The whole RPC reply is transferred via RDMA writes. Fragments are added on any RDMA write packet, RDMA_WRITE_ONLY, RDMA_WRITE_FIRST, etc., and the reassembly is done on the reply message. The RPC reply packet has no data (RDMA_NOMSG) but fragments are reassembled and the whole RPC reply is dissected. The RDMA read chunk list is used for a large RPC call which has at least one large opaque, e.g., NFS WRITE. The RPC call packet is used only to set up the RDMA read chunk list. It also has the reduced message data which includes the first fragment (XDR data up to and including the opaque length), but it could also have fragments between each read chunk and the last fragment after the last read chunk data. The reduced message is then broken down into fragments and inserted into the reassembly table. Since the RDMA read chunk list is set up in the RPC call then do not dissect the upper layer in this case and just label rest of packet as "Data" since the reassembly will be done on the last read response. The protocol gives the XDR position where each chunk must be inserted into the XDR stream thus as long as the maximum I/O size is known it is possible to know exactly where to insert these fragments. This maximum I/O size is set on the first READ_RESPONSE_FIRST or READ_RESPONSE_MIDDLE but in case where any of these packets have not been seen then a value of 100 is used (real value should be at least 1024) but in this case the message numbers are not consecutive between chunks but since the total size of all chunks is verified to make sure there is a complete message to reassemble then all fragments should be in the correct order. Fragments are added on any RDMA read packet: RDMA_READ_RESPONSE_ONLY, RDMA_READ_RESPONSE_FIRST, etc., and the reassembly is done on the last read response. Since there could be multiple chunks and each chunk could have multiple segments then the total size must be checked to complete the reassembly because in this case there will be multiple READ_RESPONSE_LAST. The RDMA write chunk list is used for a large RPC reply which has at least one large opaque, e.g., NFS READ. The RPC call packet is used only to set up the RDMA write chunk list. The opaque data is then transferred via RDMA writes and then the RPC reply packet is sent from the server. The RPC reply packet has the reduced message data which includes the first fragment (XDR data up to and including the opaque length), but it could also have fragments between each write chunk and the last fragment after the last write chunk data. The reduced message is then broken down into fragments and inserted into the reassembly table. Since the RPC reply is sent after all the RDMA writes then the fragments from these writes must be inserted in the correct order: the first RDMA write fragment is inserted with message number 1, since the first fragment (message number 0) will come from the very last packet (the RPC reply with RDMA_MSG). Also, the last packet could have fragments which must be inserted in between chunk data, therefore message numbers from one chunk to another are not consecutive. In contrast with the RDMA read chunk list, the protocol does not allow an XDR position in the RDMA write chunks, since the RPC client knows exactly where to insert the chunk's data because of the virtual address of the DDP (direct data placement) item. There is no way to map a virtual address with an XDR position, thus in order to reassemble the XDR stream a two pass approach is used. In the first pass (visited = 0), all RDMA writes are inserted as fragments leaving a gap in between each chunk. Then the dissector for the upper layer is called with a flag letting the dissector know that it is dealing with a reduced message so all DDP enabled operations handle the opaque data as having only the size of the opaque but not the data and reporting back the offset from the end of the message. Once the upper layer dissector returns, this layer now has a list of DDP eligible item's offsets which are then translated into XDR offsets and then the RPC reply packet is broken into fragments and inserted in the right places as in the case for the RDMA read chunk list. On the second pass (visited = 1), all fragments have already been inserted into the reassembly table so it just needs to reassembled the whole message and then call the upper layer dissector. RFC 8267 specifies the upper layer bindings to RPC-over-RDMA version 1 for NFS. Since RPC-over-RDMA version 1 specifies the XDR position for the read chunks then only the write chunk DDP eligible items are handled in the upper layer, in this case the NFS layer. These are the only procedures or operations eligible for write chunks: * The opaque data result in the NFS READ procedure or operation * The pathname or linkdata result in the NFS READLINK procedure or operation Two functions are defined to signal and report back the DDP eligible item's offset to be used by the upper layers. Function rpcrdma_is_reduced() is used to signal the upper layer that it is dealing with a reduced data message and thus should ignore DDP eligible item's opaque processing and just report back the offset where the opaque data should be. This reporting is done using the second function rpcrdma_insert_offset(). Reassembly is done for InfiniBand only. Reassemble fragments using the packet sequence number (PSN) of each RDMA I/O fragment to make sure the message is reassembled correctly when fragments are sent out of order. Also a unique message id is used for each message so fragments are reassembled correctly when fragments of different messages are sent in parallel. The reassembled message could be composed of multiple chunks and each chunk in turn could be composed of multiple segments in which each segment could be composed of multiple requests and of course each request is composed of one or more fragments. Thus in order to have all fragments for each segment belonging to the same message, a list of segments is created and all segments belonging to the same message are initialized with the same message id. These segments are initialized and added to the list on the call side on RDMA_MSG by calling process_rdma_lists. Bug: 13260 Change-Id: Icf57d7c46c3ba1de5d019265eb151a81d6019dfd Reviewed-on: https://code.wireshark.org/review/24613 Petri-Dish: Anders Broman <a.broman58@gmail.com> Tested-by: Petri Dish Buildbot Reviewed-by: Anders Broman <a.broman58@gmail.com>
2017-11-14 21:55:14 +00:00
*
* Local variables:
* c-basic-offset: 4
* tab-width: 8
* indent-tabs-mode: nil
* End:
*
* vi: set shiftwidth=4 tabstop=8 expandtab:
* :indentSize=4:tabSize=8:noTabs=true:
*/