RPC-over-RDMA: add reassembly for reply, read and write chunks

The RDMA reply chunk is used for a large RPC reply which does not fit into a single SEND operation and does not have a single large opaque, e.g., NFS READDIR. The RPC call packet is used only to set up the RDMA reply chunk. The whole RPC reply is transferred via RDMA writes. Fragments are added on any RDMA write packet, RDMA_WRITE_ONLY, RDMA_WRITE_FIRST, etc., and the reassembly is done on the reply message. The RPC reply packet has no data (RDMA_NOMSG) but fragments are reassembled and the whole RPC reply is dissected. The RDMA read chunk list is used for a large RPC call which has at least one large opaque, e.g., NFS WRITE. The RPC call packet is used only to set up the RDMA read chunk list. It also has the reduced message data which includes the first fragment (XDR data up to and including the opaque length), but it could also have fragments between each read chunk and the last fragment after the last read chunk data. The reduced message is then broken down into fragments and inserted into the reassembly table. Since the RDMA read chunk list is set up in the RPC call then do not dissect the upper layer in this case and just label rest of packet as "Data" since the reassembly will be done on the last read response. The protocol gives the XDR position where each chunk must be inserted into the XDR stream thus as long as the maximum I/O size is known it is possible to know exactly where to insert these fragments. This maximum I/O size is set on the first READ_RESPONSE_FIRST or READ_RESPONSE_MIDDLE but in case where any of these packets have not been seen then a value of 100 is used (real value should be at least 1024) but in this case the message numbers are not consecutive between chunks but since the total size of all chunks is verified to make sure there is a complete message to reassemble then all fragments should be in the correct order. Fragments are added on any RDMA read packet: RDMA_READ_RESPONSE_ONLY, RDMA_READ_RESPONSE_FIRST, etc., and the reassembly is done on the last read response. Since there could be multiple chunks and each chunk could have multiple segments then the total size must be checked to complete the reassembly because in this case there will be multiple READ_RESPONSE_LAST. The RDMA write chunk list is used for a large RPC reply which has at least one large opaque, e.g., NFS READ. The RPC call packet is used only to set up the RDMA write chunk list. The opaque data is then transferred via RDMA writes and then the RPC reply packet is sent from the server. The RPC reply packet has the reduced message data which includes the first fragment (XDR data up to and including the opaque length), but it could also have fragments between each write chunk and the last fragment after the last write chunk data. The reduced message is then broken down into fragments and inserted into the reassembly table. Since the RPC reply is sent after all the RDMA writes then the fragments from these writes must be inserted in the correct order: the first RDMA write fragment is inserted with message number 1, since the first fragment (message number 0) will come from the very last packet (the RPC reply with RDMA_MSG). Also, the last packet could have fragments which must be inserted in between chunk data, therefore message numbers from one chunk to another are not consecutive. In contrast with the RDMA read chunk list, the protocol does not allow an XDR position in the RDMA write chunks, since the RPC client knows exactly where to insert the chunk's data because of the virtual address of the DDP (direct data placement) item. There is no way to map a virtual address with an XDR position, thus in order to reassemble the XDR stream a two pass approach is used. In the first pass (visited = 0), all RDMA writes are inserted as fragments leaving a gap in between each chunk. Then the dissector for the upper layer is called with a flag letting the dissector know that it is dealing with a reduced message so all DDP enabled operations handle the opaque data as having only the size of the opaque but not the data and reporting back the offset from the end of the message. Once the upper layer dissector returns, this layer now has a list of DDP eligible item's offsets which are then translated into XDR offsets and then the RPC reply packet is broken into fragments and inserted in the right places as in the case for the RDMA read chunk list. On the second pass (visited = 1), all fragments have already been inserted into the reassembly table so it just needs to reassembled the whole message and then call the upper layer dissector. RFC 8267 specifies the upper layer bindings to RPC-over-RDMA version 1 for NFS. Since RPC-over-RDMA version 1 specifies the XDR position for the read chunks then only the write chunk DDP eligible items are handled in the upper layer, in this case the NFS layer. These are the only procedures or operations eligible for write chunks: * The opaque data result in the NFS READ procedure or operation * The pathname or linkdata result in the NFS READLINK procedure or operation Two functions are defined to signal and report back the DDP eligible item's offset to be used by the upper layers. Function rpcrdma_is_reduced() is used to signal the upper layer that it is dealing with a reduced data message and thus should ignore DDP eligible item's opaque processing and just report back the offset where the opaque data should be. This reporting is done using the second function rpcrdma_insert_offset(). Reassembly is done for InfiniBand only. Reassemble fragments using the packet sequence number (PSN) of each RDMA I/O fragment to make sure the message is reassembled correctly when fragments are sent out of order. Also a unique message id is used for each message so fragments are reassembled correctly when fragments of different messages are sent in parallel. The reassembled message could be composed of multiple chunks and each chunk in turn could be composed of multiple segments in which each segment could be composed of multiple requests and of course each request is composed of one or more fragments. Thus in order to have all fragments for each segment belonging to the same message, a list of segments is created and all segments belonging to the same message are initialized with the same message id. These segments are initialized and added to the list on the call side on RDMA_MSG by calling process_rdma_lists. Bug: 13260 Change-Id: Icf57d7c46c3ba1de5d019265eb151a81d6019dfd Reviewed-on: https://code.wireshark.org/review/24613 Petri-Dish: Anders Broman <a.broman58@gmail.com> Tested-by: Petri Dish Buildbot Reviewed-by: Anders Broman <a.broman58@gmail.com>
author: Jorge Mora <jmora1300@gmail.com> 2017-11-14 16:55:14 -0500
committer: Anders Broman <a.broman58@gmail.com> 2018-03-24 07:09:59 +0000
commit: 8f0f691312d93b12511c5cebf3e414b15e7661a4 (patch)
tree: 6460808a6a69fa90a934512c01682c3dddee0454 /epan/dissectors/packet-nfs.c
parent: a6fcceb98e3b483b859753494524c5ef64a4e844 (diff)
1 files changed, 61 insertions, 6 deletions
diff --git a/epan/dissectors/packet-nfs.c b/epan/dissectors/packet-nfs.c
index ce738a5b51..f0fc7adcf7 100644
--- a/epan/dissectors/packet-nfs.c
+++ b/epan/dissectors/packet-nfs.c
@@ -26,6 +26,7 @@
 #include <epan/crc32-tvb.h>
 #include <wsutil/str_util.h>
 #include "packet-nfs.h"
+#include "packet-rpcrdma.h"
 
 void proto_register_nfs(void);
 void proto_reg_handoff_nfs(void);
@@ -953,6 +954,17 @@ static wmem_tree_t *nfs_file_handles = NULL;
 static gboolean nfs_display_v4_tag = TRUE;
 static gboolean display_major_nfs4_ops = TRUE;
 
+/* Types of RDMA reduced opaque data */
+typedef enum {
+	R_UTF8STRING,
+	R_NFS2_PATH,
+	R_NFS3_PATH,
+	R_NFSDATA,
+} rdma_reduce_type_t;
+
+static int dissect_nfsdata_reduced(rdma_reduce_type_t rtype, tvbuff_t *tvb,
+			int offset, proto_tree *tree, int hf, const char **name);
+
 static int dissect_nfs4_stateid(tvbuff_t *tvb, int offset, proto_tree *tree, guint16 *hash);
 
 static void nfs_prompt(packet_info *pinfo _U_, gchar* result)
@@ -3039,7 +3051,7 @@ dissect_nfs2_readlink_reply(tvbuff_t *tvb, packet_info *pinfo _U_,
 	offset = dissect_nfs2_status(tvb, offset, tree, &status);
 	switch (status) {
 		case 0:
-			offset = dissect_path(tvb, offset, tree, hf_nfs2_readlink_data, &name);
+			offset = dissect_nfsdata_reduced(R_NFS2_PATH, tvb, offset, tree, hf_nfs2_readlink_data, &name);
 			col_append_fstr(pinfo->cinfo, COL_INFO, " Path: %s", name);
 			proto_item_append_text(tree, ", READLINK Reply Path: %s", name);
 		break;
@@ -3097,7 +3109,7 @@ dissect_nfs2_read_reply(tvbuff_t *tvb, packet_info *pinfo _U_,
 		case 0:
 			offset = dissect_nfs2_fattr(tvb, offset, tree, "attributes");
 			proto_item_append_text(tree, ", READ Reply");
-			offset = dissect_nfsdata(tvb, offset, tree, hf_nfs_data);
+			offset = dissect_nfsdata_reduced(R_NFSDATA, tvb, offset, tree, hf_nfs_data, NULL);
 		break;
 		default:
 			err = val_to_str_ext(status, &names_nfs2_stat_ext, "Unknown error: %u");
@@ -4808,7 +4820,7 @@ dissect_nfs3_readlink_reply(tvbuff_t *tvb, packet_info *pinfo _U_,
 		case 0:
 			offset = dissect_nfs3_post_op_attr(tvb, offset, pinfo, tree,
 				"symlink_attributes");
-			offset = dissect_nfs3_path(tvb, offset, tree,
+			offset = dissect_nfsdata_reduced(R_NFS3_PATH, tvb, offset, tree,
 				hf_nfs2_readlink_data, &name);
 
 			col_append_fstr(pinfo->cinfo, COL_INFO, " Path: %s", name);
@@ -4878,7 +4890,7 @@ dissect_nfs3_read_reply(tvbuff_t *tvb, packet_info *pinfo _U_,
 				offset);
 			col_append_fstr(pinfo->cinfo, COL_INFO, " Len: %d", len);
 			proto_item_append_text(tree, ", READ Reply Len: %d", len);
-			offset = dissect_nfsdata(tvb, offset, tree, hf_nfs_data);
+			offset = dissect_nfsdata_reduced(R_NFSDATA, tvb, offset, tree, hf_nfs_data, NULL);
 		break;
 		default:
 			offset = dissect_nfs3_post_op_attr(tvb, offset, pinfo, tree,
@@ -6157,6 +6169,49 @@ dissect_nfs_utf8string(tvbuff_t *tvb, int offset,
 }
 
 
+/*
+ * When using RPC-over-RDMA, certain opaque data are eligible for DDP
+ * (direct data placement), so these must be reduced by sending just
+ * the opaque length with the rest of the NFS packet and the opaque
+ * data is sent separately using RDMA (RFC 8267).
+ */
+static int
+dissect_nfsdata_reduced(rdma_reduce_type_t rtype, tvbuff_t *tvb, int offset,
+			proto_tree *tree, int hf, const char **name)
+{
+	if (rpcrdma_is_reduced()) {
+		/*
+		 * The opaque data is reduced so just increment the offset
+		 * since there is no actual data yet.
+		 */
+		offset += 4;
+		/* Add offset (from the end) where the opaque data should be */
+		rpcrdma_insert_offset(tvb_reported_length_remaining(tvb, offset));
+		if (name) {
+			/* Return non-NULL string */
+			*name = "";
+		}
+	} else {
+		/* No data reduction, dissect the opaque data */
+		switch (rtype) {
+			case R_UTF8STRING:
+				offset = dissect_nfs_utf8string(tvb, offset, tree, hf, name);
+				break;
+			case R_NFS2_PATH:
+				offset = dissect_path(tvb, offset, tree, hf, name);
+				break;
+			case R_NFS3_PATH:
+				offset = dissect_nfs3_path(tvb, offset, tree, hf, name);
+				break;
+			case R_NFSDATA:
+				offset = dissect_nfsdata(tvb, offset, tree, hf);
+				break;
+		}
+	}
+	return offset;
+}
+
+
 static int
 dissect_nfs4_deviceid(tvbuff_t *tvb, int offset, proto_tree *tree)
 {
@@ -10530,7 +10585,7 @@ dissect_nfs4_response_op(tvbuff_t *tvb, int offset, packet_info *pinfo, proto_tr
 		case NFS4_OP_READ:
 			offset = dissect_rpc_uint32(tvb, newftree, hf_nfs4_eof, offset);
 			dissect_rpc_uint32(tvb, newftree, hf_nfs4_read_data_length, offset); /* don't change offset */
-			offset = dissect_nfsdata(tvb, offset, newftree, hf_nfs_data);
+			offset = dissect_nfsdata_reduced(R_NFSDATA, tvb, offset, newftree, hf_nfs_data, NULL);
 			break;
 
 		case NFS4_OP_READDIR:
@@ -10539,7 +10594,7 @@ dissect_nfs4_response_op(tvbuff_t *tvb, int offset, packet_info *pinfo, proto_tr
 			break;
 
 		case NFS4_OP_READLINK:
-			offset = dissect_nfs_utf8string(tvb, offset, newftree, hf_nfs4_linktext, NULL);
+			offset = dissect_nfsdata_reduced(R_UTF8STRING, tvb, offset, newftree, hf_nfs4_linktext, NULL);
 			break;
 
 		case NFS4_OP_RECLAIM_COMPLETE:
author	Jorge Mora <jmora1300@gmail.com>	2017-11-14 16:55:14 -0500
committer	Anders Broman <a.broman58@gmail.com>	2018-03-24 07:09:59 +0000
commit	8f0f691312d93b12511c5cebf3e414b15e7661a4 (patch)
tree	6460808a6a69fa90a934512c01682c3dddee0454 /epan/dissectors/packet-nfs.c
parent	a6fcceb98e3b483b859753494524c5ef64a4e844 (diff)