diff options
author | Jorge Mora <jmora1300@gmail.com> | 2017-11-14 16:55:14 -0500 |
---|---|---|
committer | Anders Broman <a.broman58@gmail.com> | 2018-03-24 07:09:59 +0000 |
commit | 8f0f691312d93b12511c5cebf3e414b15e7661a4 (patch) | |
tree | 6460808a6a69fa90a934512c01682c3dddee0454 /epan/dissectors/packet-nfs.c | |
parent | a6fcceb98e3b483b859753494524c5ef64a4e844 (diff) |
RPC-over-RDMA: add reassembly for reply, read and write chunks
The RDMA reply chunk is used for a large RPC reply which does not fit
into a single SEND operation and does not have a single large opaque,
e.g., NFS READDIR. The RPC call packet is used only to set up the RDMA
reply chunk. The whole RPC reply is transferred via RDMA writes.
Fragments are added on any RDMA write packet, RDMA_WRITE_ONLY,
RDMA_WRITE_FIRST, etc., and the reassembly is done on the reply
message. The RPC reply packet has no data (RDMA_NOMSG) but
fragments are reassembled and the whole RPC reply is dissected.
The RDMA read chunk list is used for a large RPC call which has
at least one large opaque, e.g., NFS WRITE. The RPC call packet
is used only to set up the RDMA read chunk list. It also has the
reduced message data which includes the first fragment (XDR data
up to and including the opaque length), but it could also have
fragments between each read chunk and the last fragment after
the last read chunk data. The reduced message is then broken
down into fragments and inserted into the reassembly table.
Since the RDMA read chunk list is set up in the RPC call
then do not dissect the upper layer in this case and just
label rest of packet as "Data" since the reassembly will
be done on the last read response.
The protocol gives the XDR position where each chunk must be
inserted into the XDR stream thus as long as the maximum
I/O size is known it is possible to know exactly where to
insert these fragments. This maximum I/O size is set on the
first READ_RESPONSE_FIRST or READ_RESPONSE_MIDDLE but in case
where any of these packets have not been seen then a value
of 100 is used (real value should be at least 1024) but in
this case the message numbers are not consecutive between
chunks but since the total size of all chunks is verified to
make sure there is a complete message to reassemble then all
fragments should be in the correct order.
Fragments are added on any RDMA read packet: RDMA_READ_RESPONSE_ONLY,
RDMA_READ_RESPONSE_FIRST, etc., and the reassembly is done on the
last read response. Since there could be multiple chunks and each
chunk could have multiple segments then the total size must be
checked to complete the reassembly because in this case there
will be multiple READ_RESPONSE_LAST.
The RDMA write chunk list is used for a large RPC reply which has
at least one large opaque, e.g., NFS READ. The RPC call packet is
used only to set up the RDMA write chunk list. The opaque data is
then transferred via RDMA writes and then the RPC reply packet is
sent from the server.
The RPC reply packet has the reduced message data which includes
the first fragment (XDR data up to and including the opaque length),
but it could also have fragments between each write chunk and the
last fragment after the last write chunk data. The reduced message
is then broken down into fragments and inserted into the reassembly
table. Since the RPC reply is sent after all the RDMA writes then
the fragments from these writes must be inserted in the correct
order: the first RDMA write fragment is inserted with message
number 1, since the first fragment (message number 0) will come
from the very last packet (the RPC reply with RDMA_MSG). Also,
the last packet could have fragments which must be inserted in
between chunk data, therefore message numbers from one chunk to
another are not consecutive.
In contrast with the RDMA read chunk list, the protocol does not
allow an XDR position in the RDMA write chunks, since the RPC
client knows exactly where to insert the chunk's data because
of the virtual address of the DDP (direct data placement) item.
There is no way to map a virtual address with an XDR position,
thus in order to reassemble the XDR stream a two pass approach
is used. In the first pass (visited = 0), all RDMA writes are
inserted as fragments leaving a gap in between each chunk.
Then the dissector for the upper layer is called with a flag
letting the dissector know that it is dealing with a reduced
message so all DDP enabled operations handle the opaque data
as having only the size of the opaque but not the data and
reporting back the offset from the end of the message.
Once the upper layer dissector returns, this layer now has a
list of DDP eligible item's offsets which are then translated
into XDR offsets and then the RPC reply packet is broken into
fragments and inserted in the right places as in the case for
the RDMA read chunk list. On the second pass (visited = 1),
all fragments have already been inserted into the reassembly
table so it just needs to reassembled the whole message and
then call the upper layer dissector.
RFC 8267 specifies the upper layer bindings to RPC-over-RDMA
version 1 for NFS. Since RPC-over-RDMA version 1 specifies the
XDR position for the read chunks then only the write chunk DDP
eligible items are handled in the upper layer, in this case the
NFS layer. These are the only procedures or operations eligible
for write chunks:
* The opaque data result in the NFS READ procedure or operation
* The pathname or linkdata result in the NFS READLINK procedure
or operation
Two functions are defined to signal and report back the DDP
eligible item's offset to be used by the upper layers.
Function rpcrdma_is_reduced() is used to signal the upper layer
that it is dealing with a reduced data message and thus should
ignore DDP eligible item's opaque processing and just report
back the offset where the opaque data should be. This reporting
is done using the second function rpcrdma_insert_offset().
Reassembly is done for InfiniBand only. Reassemble fragments using
the packet sequence number (PSN) of each RDMA I/O fragment to make
sure the message is reassembled correctly when fragments are sent
out of order. Also a unique message id is used for each message so
fragments are reassembled correctly when fragments of different
messages are sent in parallel.
The reassembled message could be composed of multiple chunks
and each chunk in turn could be composed of multiple segments
in which each segment could be composed of multiple requests
and of course each request is composed of one or more fragments.
Thus in order to have all fragments for each segment belonging
to the same message, a list of segments is created and all
segments belonging to the same message are initialized with
the same message id. These segments are initialized and added
to the list on the call side on RDMA_MSG by calling
process_rdma_lists.
Bug: 13260
Change-Id: Icf57d7c46c3ba1de5d019265eb151a81d6019dfd
Reviewed-on: https://code.wireshark.org/review/24613
Petri-Dish: Anders Broman <a.broman58@gmail.com>
Tested-by: Petri Dish Buildbot
Reviewed-by: Anders Broman <a.broman58@gmail.com>
Diffstat (limited to 'epan/dissectors/packet-nfs.c')
-rw-r--r-- | epan/dissectors/packet-nfs.c | 67 |
1 files changed, 61 insertions, 6 deletions
diff --git a/epan/dissectors/packet-nfs.c b/epan/dissectors/packet-nfs.c index ce738a5b51..f0fc7adcf7 100644 --- a/epan/dissectors/packet-nfs.c +++ b/epan/dissectors/packet-nfs.c @@ -26,6 +26,7 @@ #include <epan/crc32-tvb.h> #include <wsutil/str_util.h> #include "packet-nfs.h" +#include "packet-rpcrdma.h" void proto_register_nfs(void); void proto_reg_handoff_nfs(void); @@ -953,6 +954,17 @@ static wmem_tree_t *nfs_file_handles = NULL; static gboolean nfs_display_v4_tag = TRUE; static gboolean display_major_nfs4_ops = TRUE; +/* Types of RDMA reduced opaque data */ +typedef enum { + R_UTF8STRING, + R_NFS2_PATH, + R_NFS3_PATH, + R_NFSDATA, +} rdma_reduce_type_t; + +static int dissect_nfsdata_reduced(rdma_reduce_type_t rtype, tvbuff_t *tvb, + int offset, proto_tree *tree, int hf, const char **name); + static int dissect_nfs4_stateid(tvbuff_t *tvb, int offset, proto_tree *tree, guint16 *hash); static void nfs_prompt(packet_info *pinfo _U_, gchar* result) @@ -3039,7 +3051,7 @@ dissect_nfs2_readlink_reply(tvbuff_t *tvb, packet_info *pinfo _U_, offset = dissect_nfs2_status(tvb, offset, tree, &status); switch (status) { case 0: - offset = dissect_path(tvb, offset, tree, hf_nfs2_readlink_data, &name); + offset = dissect_nfsdata_reduced(R_NFS2_PATH, tvb, offset, tree, hf_nfs2_readlink_data, &name); col_append_fstr(pinfo->cinfo, COL_INFO, " Path: %s", name); proto_item_append_text(tree, ", READLINK Reply Path: %s", name); break; @@ -3097,7 +3109,7 @@ dissect_nfs2_read_reply(tvbuff_t *tvb, packet_info *pinfo _U_, case 0: offset = dissect_nfs2_fattr(tvb, offset, tree, "attributes"); proto_item_append_text(tree, ", READ Reply"); - offset = dissect_nfsdata(tvb, offset, tree, hf_nfs_data); + offset = dissect_nfsdata_reduced(R_NFSDATA, tvb, offset, tree, hf_nfs_data, NULL); break; default: err = val_to_str_ext(status, &names_nfs2_stat_ext, "Unknown error: %u"); @@ -4808,7 +4820,7 @@ dissect_nfs3_readlink_reply(tvbuff_t *tvb, packet_info *pinfo _U_, case 0: offset = dissect_nfs3_post_op_attr(tvb, offset, pinfo, tree, "symlink_attributes"); - offset = dissect_nfs3_path(tvb, offset, tree, + offset = dissect_nfsdata_reduced(R_NFS3_PATH, tvb, offset, tree, hf_nfs2_readlink_data, &name); col_append_fstr(pinfo->cinfo, COL_INFO, " Path: %s", name); @@ -4878,7 +4890,7 @@ dissect_nfs3_read_reply(tvbuff_t *tvb, packet_info *pinfo _U_, offset); col_append_fstr(pinfo->cinfo, COL_INFO, " Len: %d", len); proto_item_append_text(tree, ", READ Reply Len: %d", len); - offset = dissect_nfsdata(tvb, offset, tree, hf_nfs_data); + offset = dissect_nfsdata_reduced(R_NFSDATA, tvb, offset, tree, hf_nfs_data, NULL); break; default: offset = dissect_nfs3_post_op_attr(tvb, offset, pinfo, tree, @@ -6157,6 +6169,49 @@ dissect_nfs_utf8string(tvbuff_t *tvb, int offset, } +/* + * When using RPC-over-RDMA, certain opaque data are eligible for DDP + * (direct data placement), so these must be reduced by sending just + * the opaque length with the rest of the NFS packet and the opaque + * data is sent separately using RDMA (RFC 8267). + */ +static int +dissect_nfsdata_reduced(rdma_reduce_type_t rtype, tvbuff_t *tvb, int offset, + proto_tree *tree, int hf, const char **name) +{ + if (rpcrdma_is_reduced()) { + /* + * The opaque data is reduced so just increment the offset + * since there is no actual data yet. + */ + offset += 4; + /* Add offset (from the end) where the opaque data should be */ + rpcrdma_insert_offset(tvb_reported_length_remaining(tvb, offset)); + if (name) { + /* Return non-NULL string */ + *name = ""; + } + } else { + /* No data reduction, dissect the opaque data */ + switch (rtype) { + case R_UTF8STRING: + offset = dissect_nfs_utf8string(tvb, offset, tree, hf, name); + break; + case R_NFS2_PATH: + offset = dissect_path(tvb, offset, tree, hf, name); + break; + case R_NFS3_PATH: + offset = dissect_nfs3_path(tvb, offset, tree, hf, name); + break; + case R_NFSDATA: + offset = dissect_nfsdata(tvb, offset, tree, hf); + break; + } + } + return offset; +} + + static int dissect_nfs4_deviceid(tvbuff_t *tvb, int offset, proto_tree *tree) { @@ -10530,7 +10585,7 @@ dissect_nfs4_response_op(tvbuff_t *tvb, int offset, packet_info *pinfo, proto_tr case NFS4_OP_READ: offset = dissect_rpc_uint32(tvb, newftree, hf_nfs4_eof, offset); dissect_rpc_uint32(tvb, newftree, hf_nfs4_read_data_length, offset); /* don't change offset */ - offset = dissect_nfsdata(tvb, offset, newftree, hf_nfs_data); + offset = dissect_nfsdata_reduced(R_NFSDATA, tvb, offset, newftree, hf_nfs_data, NULL); break; case NFS4_OP_READDIR: @@ -10539,7 +10594,7 @@ dissect_nfs4_response_op(tvbuff_t *tvb, int offset, packet_info *pinfo, proto_tr break; case NFS4_OP_READLINK: - offset = dissect_nfs_utf8string(tvb, offset, newftree, hf_nfs4_linktext, NULL); + offset = dissect_nfsdata_reduced(R_UTF8STRING, tvb, offset, newftree, hf_nfs4_linktext, NULL); break; case NFS4_OP_RECLAIM_COMPLETE: |